PARALLEL DATA LAB 

PDL Abstract

This is Why ML-driven Cluster Scheduling Remains Widely Impractical

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-19-103, May 2019.

Michael Kuchnik1†, Jun Woo Park1†, Chuck Cranor†, Elisabeth Moore‡, Nathan DeBardeleben‡, George Amvrosiadis†

†Carnegie Mellon University
‡Los Alamos National Laboratory

http://www.pdl.cmu.edu/

Using learning algorithms in cluster schedulers has the potential to increase the effectiveness of a computing cluster, yet deployments are limited. Our analysis of a diverse set of workload traces from national labs and industry points to three challenges that have received little attention in the literature. First, lack of data diversity can adversely affect the design of prediction systems. We demonstrate how we detected limitations, such as impractical feature engineering, unreliable prediction performance, and inconspicuous overfitting in prior work. Second, workload changes can negatively affect predictor performance. For the first time, we quantify how accuracy degrades over time due to the non-stationarity that is common across most traces. We further examine the effectiveness of adaptive techniques, such as online training, at addressing this problem. Third, aiming for high prediction accuracy alone does not guarantee dramatically better end-to-end performance. Our experimental results indicate that the relationship between prediction error and end-to-end performance can be small for common cluster deployments.

KEYWORDS: Machine Learning, Cluster Scheduling, Cloud Computing, High Performance Computing

FULL TR: pdf

1 Contributed equally.