PARALLEL DATA LAB 

PDL Abstract

Memory-Efficient GroupBy-Aggregate using Compressed Buffer Trees

2013 ACM Symposium on Cloud Computing (SoCC'13), Oct. 01-03 2013, Santa Clara, CA, USA.

Hrishikesh Amur^, Wolfgang Richter, David G. Andersen, Michael Kaminsky*, Karsten Schwan^, Athula Balachandran, Erik Zawadzki

Carnegie Mellon University
Pittsburgh, PA 15213

*Intel Labs
^Georgia Institute of Technology

http://www.pdl.cmu.edu/

The rapid growth of fast analytics systems, that require data processing in memory, makes memory capacity an increasingly-precious resource. This paper introduces a new compressed data structure called a Compressed Buffer Tree (CBT). Using a combination of techniques including buffering, compression, and serialization, CBTs improve the memory efficiency and performance of the GroupBy-Aggregate abstraction that forms the basis of not only batch-processing models like MapReduce, but recent fast analytics systems too. For streaming workloads, aggregation using the CBT uses 21-42% less memory than using Google SparseHash with up to 16% better throughput. The CBT is also compared to batch-mode aggregators in MapReduce runtimes such as Phoenix++ and Metis and consumes 4X and 5X less memory with 1.5-2X and 3-4X more performance respectively.

FULL PAPER: pdf