Hadoop Workload Analysis

Contact: Kai Ren, Garth Gibson

We have analyzed Hadoop workloads from three different research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design.

Overall, our analysis suggests that Hadoop usage is still in its adolescence. We do see good use of Hadoop: all workloads are dominated by data transformations that Hadoop handles well; users leverage Hadoop's ability to process massive-scale datasets; customizations are used in a visible fraction of jobs for correctness or performance reasons. However, we also find uses that go beyond what Hadoop has been designed to handle:

In summary, we find that users today make good use of their Hadoop clusters, but there is also significant room for improvement in how users interact with them:

OpenCloud Log Statistics (Total User: 78)

Distribution of Job Structures and Application Frameworks in OpenCloud Logs




Garth Gibson


Kai Ren



We thank N. Balasubramanian and M. Schmitz for helpful comments and discussions. We also thank the owners of the logs from the three Hadoop clusters for graciously sharing these logs with us. This research is supported in part by The Gordon and Betty Moore Foundation, National Science Foundation under awards, SCI-0430781, CCF-1019104. Qatar National Research Foundation 09-1116-1-172, DOE/Los Alamos National Laboratory, under contract number DE-AC52- 06NA25396/161465-1, by Intel as part of ISTC-CC.

We thank the members and companies of the PDL Consortium: Broadcom, Ltd., Citadel, EMC Corporation, Facebook, Google, Hewlett-Packard Labs, Hitachi Ltd., Intel Corporation, Microsoft Research, MongoDB, NetApp, Inc., Oracle Corporation, Samsung Information Systems America, Seagate Technology, Tintri, Two Sigma, Uber, Veritas and Western Digital for their interest, insights, feedback, and support.




© 2016. Last updated 3 September, 2013