PARALLEL DATA LAB 

PDL Abstract

Hadoop's Adolescence: A Comparative Workload Analysis from Three Research Clusters

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-106. June 2012.

Kai Ren, YongChul Kwon*, Magdalena Balazinska*, Bill Howe*

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

*University of Washington

http://www.pdl.cmu.edu/

We analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools as well as significant opportunities for optimization. We see significant diversity in application styles, including some "interactive" workloads, motivating new tools in the ecosystem. We find that some conventional approaches to improving performance are not especially effective and suggest some alternatives. Overall, we find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research.

KEYWORDS: Hadoop, Workload Analysis, User Behavior, Storage, Load Balance

FULL TR: pdf