DATE: Thursday, June 28, 2012
TIME: Noon - 1:15 pm

SPEAKER: Mohammad Hammoud, CMU Qatar

TITLE: Map Concurrency Characterization and Center-of-Gravity Task Scheduling for Improving MapReduce Performance on the Cloud

MapReduce is now a pervasive analytics engine on the cloud. Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity. In this work, we identify two design factors that affect Hadoop performance and investigate an alternative approach for each factor. First, Hadoop has a high-dimensional space of configuration parameters that poses a burden on practitioners, like computation scientists, system researchers, and business analysts, to set for efficient and cost-effective execution. We observe that MapReduce application performance is highly influenced by map concurrency, defined in terms of two configurable parameters, the number of available map slots and the number of map tasks running over the slots. We show that some inherent MapReduce characteristics allow systematic and well-informed prediction of MapReduce performance response (runtime increase or decrease) as map concurrency is varied. We propose Map Concurrency Characterization (MC2), a predictor for MapReduce performance response. MC2 allows guiding Map phase configuration and, consequently, improving Hadoop performance. Second, we realize that Hadoop neither exploits data locality nor addresses partitioning skew present in some MapReduce applications when it schedules reduce tasks. This might lead to increased cluster network traffic. We scrutinize the problems of data locality and partitioning skew in Hadoop and propose Center-of-Gravity Reduce Scheduler (CoGRS), a locality-aware skew-aware reduce task scheduler for saving MapReduce network traffic. In an attempt to exploit data locality, CoGRS schedules each reduce task at its center-of-gravity node, which is computed after considering partitioning skew as well. We implemented MC2 and CoGRS using Hadoop 0.20.2 and conducted comprehensive experiments on a private cloud and on Amazon EC2. Our results show that MC2 can correctly predict MapReduce performance response and provide up to 2.3X speedup in runtime. In addition, CoGRS minimizes off-rack network traffic by an up to 38.6%, which translates to a runtime reduction of 23.8% for the tested benchmarks.

Mohammad Hammoud is currently a postdoctoral research associate at Carnegie Mellon University (CMU) in Qatar. He received MS and PhD degrees in Computer Science from the University of Pittsburgh in 2010. Dr. Hammoud was a member in the Computer Architecture, Systems, and Technology Laboratory (now named Scalable Computing Group- XCG) at the University of Pittsburgh from 2005 to 2010. For 5 years he worked on architectural techniques to effectively manage caches in chip multiprocessors under the supervisions of Dr. Rami Melhem and Dr. Sangyeun Cho. His research interests include different areas in computer architecture and data intensive scalable computing. Currently, he is working on improving scheduling mechanisms for MapReduce applications and characterizing scientific applications to enable intelligent configuration of distributed analytics engines and effective provisioning of virtual resources on the cloud.

VISITOR HOST: Garth Gibson

VISITOR COORDINATOR: Jennifer Landefeld (

Karen Lindenfelser, 86716, or visit