PARALLEL DATA LAB 

PDL Abstract

Memory-Efficient Group-By-Aggregate using Compressed Buffer Trees

Georgia Tech Center for Experimental Research in Computer Systems Technical Report GIT-CERCS-12-08.

Hrishikesh Amur^, Wolfgang Richter, David G. Andersen, Michael Kaminsky^, Karsten Schwan^, Athula Balachandran, Erik Zawadzki

Carnegie Mellon University
Pittsburgh, PA 15213

*Intel Labs
^Georgia Institute of Technology

http://www.pdl.cmu.edu/

Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupBy-Aggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hashbased aggregation techniques: up to 2X less memory and 1.5X the throughput, at the cost of 2.5X CPU.

FULL PAPER: pdf