PARALLEL DATA LAB 

PDL Abstract

An Analysis of Traces from a Production MapReduce Cluster

10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). May 17-20, 2010, Melbourne, Victoria, Australia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-107, December, 2009.

Soila Kavulya, Jiaqi Tan*, Rajeev Gandhi and Priya Narasimhan

School of Computer Science & Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

* DSO National Laboratories, Singapore

http://www.pdl.cmu.edu/

MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for systems research. We characterized resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.

KEYWRORDS: MapReduce, Workload characterization, Performance prediction

FULL PAPER: pdf
FULL TR: pdf