PARALLEL DATA LAB 

PDL Abstract

Sia: Heterogeneity-aware, Goodput-optimized ML-cluster Scheduling

ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP ’23), October 23–26, 2023, Koblenz, Germany.

Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu Lin‡, Aurick Qiao†, Zhihao Jia, and Gregory R. Ganger

Carnegie Mellon University
† Petuum, Inc.
‡ Cornell University

http://www.pdl.cmu.edu/

The Sia1 scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or another (e.g., heterogeneity or resource-adaptivity), none addresses all and most scale poorly to large clusters and/or heavy workloads even without the full complexity of the combined scheduling problem. Sia introduces a new sched- uling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in clus- ter load and job mix over time. Sia also introduces a low- profiling-overhead approach to bootstrapping (for each new job) throughput models used to evaluate possible resource assignments, and it is the first cluster scheduler to support elastic scaling of hybrid parallel jobs.

Extensive evaluations show that Sia outperforms state-of- the-art schedulers. For example, even on relatively small 44- to 64-GPU clusters with a mix of three GPU types, Sia reduces average job completion time (JCT) by 30–93%, 99th percentile JCT and makespan by 28–95%, and GPU hours used by 12– 55% for workloads derived from 3 real-world environments. Additional experiments demonstrate that Sia scales to at least 2000-GPU clusters, provides improved fairness, and is not over-sensitive to scheduler parameter settings.

1 In Egyptian mythology, Sia is the god of perception/intelligence [1], not to be confused with the popular music artist [2].

KEYWORDS: cluster scheduling, resource allocation, deep learn- ing training
FULL PAPER: pdf