Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Machine learning (ML) has become a powerful building block for modern services, scientic endeavors and enterprise processes. The expensive computa- tions required for training ML models often makes it desirable to run them in a distributed manner in shared computing environments (e.g., Amazon EC2, Microsoft Azure, in-house shared clusters). Shared computing environments introduce a number of challenges, including uncorrelated performance jitter, heterogeneous resources, transient resources and limited bandwidth. This dis- sertation demonstrates that, by structuring software frameworks and work distribution to exploit transient resources and address performance jitter and communication bandwidth limitations, we can improve the eciency of training machine learning models.
We support this assertion with three case study systems: FlexRR, Proteus, and PipeDream. FlexRR is a distributed machine learning training system that combines a exible synchronization model with dynamic peer-to-peer re- assignment of work among workers to address stragglers caused by performance jitter. FlexRR observes near ideal run-time, mitigating the adverse eects of stragglers observed in shared computing environments. Proteus is an agile elastic machine learning training system that uses tiers of reliability and intel- ligent resource management to eciently utilize transient compute resources. Evaluations on AWS EC2 show that Proteus reduces cost by 85% relative to non-transient pricing, and by 43% relative to previous approaches, while simul- taneously reducing runtimes by up to 37%. PipeDream is a distributed training system for deep neural networks (DNNs) that partitions ranges of DNN layers among machines and aggressively pipelines computation and communication. By reducing the amount of communication, and overlapping communication and computation, PipeDream provides a 5x or more improvement in \time to accuracy" for training large DNN models.
KEYWORDS: Distributed Machine Learning, Cloud Computing, Stragglers, Elasticity, DNNs
FULL PAPER: pdf