PARALLEL DATA LAB 

PDL Abstract

3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty

EuroSys ’18, April 23–26, 2018, Porto, Portugal. Supersedes CMU-PDL-17-107, Nov. 2017.

Jun Woo Park, Alexey Tumanov^, Angela Jiang, Michael A. Kozuch*, Gregory R. Ganger

Carnegie Mellon University
Pittsburgh, PA

*Intel Labs
^ UC Berkeley

http://www.pdl.cmu.edu/

The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even stateof- the-art predictors have wide error profiles with 8–23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.

FULL PAPER: pdf
TECHNICAL REPORT VERSION: pdf