PDL Abstract

Distribution-based Cluster Scheduling

Carnegie Mellon University School of Computer Science PhD Dissertation, June 2019.

Jun Woo Park

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

Modern computing clusters support a mixture of diverse activities, ranging from customer-facing internet services, software development and test, scientific research, and exploratory data analytics. Many schedulers exploit knowledge of pending jobs’ runtimes and resource usages as a powerful building block but suffer significant performance penalty if such knowledge is imperfect. This dissertation demonstrates that schedulers that rely on information about job runtimes and resource usages can more robustly address imperfect predictions by looking at likelihoods of possible outcomes rather than single point expected outcomes.

This dissertation presents a workload analysis and two case studies of scheduling systems: 3Sigma and DistSched. Characterization of real workloads revealed that there exists inherent variability in the job runtimes and resource usage that cannot be captured by single point estimates. An evaluation of a history-based runtime predictor with four different traces demonstrates it is not trivial to obtain perfect runtime predictions in real workloads, especially if the predictor is provided with insufficient information. 3Sigma is a scheduler that leverages distributions of the relevant runtime histories rather than just a point estimate derived from it. By leveraging distribution and mis-estimate mitigation mechanisms, 3Sigma is able to make more robust scheduling decisions and outperform state-of-the-art scheduling systems that only rely on limited or no runtime knowledge. DistSched is a scheduler that leverages distribution of the resource usage (cpu, memory, and cpu-time) and account for the risk of contention to make robust scheduling decisions. The evaluation of DistSched demonstrates that leveraging full history and mitigation mechanisms allows the scheduler to more robustly address the imperfect predictions and perform almost as good as the hypothetical system equipped with perfect knowledge of runtime and resource usage.

KEYWORDS: Planning under uncertainty, Cluster Scheduling, Cloud Computing