Parallel Data Laboratory

Big Learning (Systems for ML)

Data analytics and AI applications have emerged as a primary data processing activity for business, science, and online services that attempt to extract insight from quantities of observation data. Increasingly, such analytics center on statistical machine learning (ML), in which an algorithm determines model parameters that make a chosen statistical model best fit the training data. Once fit (trained), such models can expose relationships among data items (e.g., for grouping documents into topics), predict outcomes for new data items based on selected characteristics (e.g., for recommendation systems), correlate effects with causes (e.g., for genomic analyses of diseases), and so on.

Growth in data sizes and desired model precision generally dictates parallel execution of ML algorithms on clusters of servers. Naturally, parallel ML involves the same work distribution, synchronization and data consistency challenges as other parallel computations. The PDL big-learning group has attacked these challenges in the past 10 years, creating and refining powerful new approaches for supporting large-scale ML on Big Data. Current focuses include cluster scheduling for DNN jobs, efficient ML data pipelines, automated cloud AI support, and efficient use of heterogeneous accelerators. This short article overviews an inter-related collection of some of our prior efforts in this space. Click here for a short article overviewing an inter-related collection of some of our prior work in this space.

For more information on some of our ongoing work, see:

Consistency Theory
Due to its error-tolerant nature of ML training, the distributed workers may execute the ML algorithm using inconsistent view of the shared model state while still produce a correct outcome. The classic Bulk Synchronous Parallel (BSP) model allows workers to compute using locally cached model state, but still suffers from the synchronization barrier. We have developed a novel consistency model - Stale Synchronous Parallel, which hides the synchronization latency and provides theoretical guarantees for algorithm convergence. It has laid the foundation for many of our system designs.
Exploiting Iterativeness
Due to its iterative nature, the repeated access pattern can be exploited to speed up the execution of ML algorithms.
Network Bandwidth Management
While using stale local reads may still leads to correct outcome, the error in reads reduces algorithm performance and leads to longer training time. In this work, we are exploring the opportunity of using spare network bandwidth to reduce staleness and thus speed up training.

Straggler Mitigation
STRADS: Parallel ML performance depends on how we parallelize ML algorithm and its underlying framework performance. In this study, we present a comprehensive approach that considers both ML and system at the same time. We present SchMP (Scheduled Model Parallelism) scheme on the ML side to maximize statistical progress of parallel ML algorithm and STRADS framework on the system side to maximize throughput of SchMP programs in a distributed environment.

People

FACULTY

Greg Ganger
Phil Gibbons
Garth Gibson
Eric Xing

GRAD STUDENTS
Jinliang Wei
Aaron Harlap
Jin Kyu Kim
Henggang Cui
Wei Dai
Qirong Ho
James Cipar

Partners

CMU Sailing Lab

Publications

Federated Learning Under Distributed Concept Drift. Ellango Jothimurugesan, Kevin Hsieh, Jianyu Wang, Gauri Joshi, Phillip B. Gibbons. International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2023. In preprint arXiv:2206.00799v1.
Abstract / PDF [956K]
Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines. Michael Kuchnik, Ana Klimovic, Jirı Simsa, Virginia Smith, George Amvrosiadis. Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, August, 2022.
Abstract / PDF [7M]
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding. Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry. Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, August, 2022.
Abstract / PDF [1.3M]
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing. 15th USENIX Symposium on Operating Systems Design and Implementation, Virtual Event, July 14–16, 2021. BEST PAPER AT OSDI'21!
Abstract / PDF [930K] / Slides / Talk Video
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems. Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman. SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada.
Abstract / PDF [1M]
PipeDream: Generalized Pipeline Parallelism for DNN Training. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia. SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada.
Abstract / PDF [1M]
Rateless Codes for Distributed Computations with Sparse Compressed Matrices. Ankur Mallick, Gauri Joshi. IEEE International Symposium on Information Theory (ISIT), July 7-12, 2019, Paris, France.
Abstract / PDF [672K]
Improving ML Applications in Shared Computing Environments. Aaron Harlap. Carnegie Mellon University Electrical and Computer Engineering PhD Dissertation, May 2019.
Abstract / PDF [1.4M]
This is Why ML-driven Cluster Scheduling Remains Widely Impractical. Michael Kuchnik, Jun Woo Park, Chuck Cranor, Elisabeth Moore, Nathan DeBardeleben, George Amvrosiadis. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-19-103, May 2019.
Abstract / PDF [715K]
Fast and Efficient Distributed Matrix-Vector Multiplication Using Rateless Fountain Codes. Ankur Mallick, Malhar Chaudhari, Gauri Joshi. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 12 - 17 May, 2019 · Brighton, UK.
Abstract / PDF [485K]
Towards Lightweight and Robust Machine Learning for CDN Caching. Daniel S. Berger. HotNets-XVII, November 15–16, 2018, Redmond, WA, USA.
Abstract / PDF [610K]
Focus: Querying Large Video Datasets with Low Latency and Low Cost. Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, Onur Mutlu. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Oct. 8–10, 2018, Carlsbad, CA.
Abstract / PDF [1.2M]
Tributary: Spot-dancing for Elastic Services with Latency SLOs. Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, Phillip B. Gibbons. 2018 USENIX Annual Technical Conference. July 11–13, 2018 Boston, MA, USA. Supersedes Carnagie Mellon University Parallel Data Lab Technical Report CMU-PDL-18-102.
Abstract / PDF [1.25M]
Better Caching in Search Advertising Systems with Rapid Refresh Predictions. Conglong Li, David G. Andersen, Qiang Fu, Sameh Elnikety, Yuxiong He. Proceedings of the 2018 World Wide Web Conference, Lyon, France, April 23 - 27, 2018.
Abstract / PDF [1.1M]
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication. Ankur Mallick, Malhar Chaudhari, Gauri Joshi. arXiv:1804.10331v2 [cs.DC] 30 Apr 2018.
Abstract / PDF [1.1M]
Tributary: Spot-dancing for elastic services with latency SLOs. Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, Phillip B. Gibbons. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-18-102, Jan. 2018.
Abstract / PDF [990K]
Aging Gracefully with Geriatrix: A File System Aging Tool. Saurabh Kadekodi, Vaishnavh Nagarajan, Garth A. Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-17-106, October 2017. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-105. October, 2016.
Abstract / PDF [560K]
Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. Jack Kosaian, K.V. Rashmi, Shivaram Venkataraman. arXiv:1806.01259v1 [cs.LG], 4 Jun 2018
Abstract / PDF [575K]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training. Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons. SysML '18, Feb. 15-16, 2018 , Stanford, CA.
Abstract / PDF [615K]
Intermittent Deep Neural Network Inference. Graham Gobieski, Nathan Beckmann, Brandon Lucia. SysML 2018, February 15-16, 2018, Stanford, CA.
Abstract / PDF [450K]
Litz: An Elastic Framework for High-Performance Distributed Machine Learning. Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, Eric P. Xing. Carnegie Mellon Univedrsity Parallel Data Laboratory Technical Report CMU-PDL-17-103. June 2017.
Abstract / PDF [424K]
Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Greg Ganger, Phil Gibbons. ACM European Conference on Computer Systems, 2017 (EuroSys'17), 23rd-26th April, 2017, Belgrade, Serbia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-102. May 2016.
Abstract / PDF [743K]
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 27–29, 2017, Boston, MA.
Abstract / PDF [1.5M]
MLtuner: System Support for Automatic Machine Learning Tuning. Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-108, October 2016.
Abstract / PDF [900K]
Benchmarking Apache Spark with Machine Learning Applications. Jinliang Wei, Jin Kyu Kim, Garth A. Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-107 October 2016.
Abstract / PDF [360K]
Addressing the Straggler Problem for Iterative Convergent Parallel ML. Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2016. Oct 5-7, Santa Clara, CA. Supersedes Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-15-102, April 2015.
Abstract / PDF [519K]
GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server. Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. ACM European Conference on Computer Systems, 2016 (EuroSys'16), 18th-21st April, 2016, London, UK.
Abstract / PDF [617K]
STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing. ACM European Conference on Computer Systems, 2016 (EuroSys'16), 18th-21st April, 2016, London, UK.
Abstract / PDF [1.6M]
Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics. Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2015. Aug. 27 - 29, 2015, Kohala Coast, HI.
Abstract / PDF [369K]
Using Data Transformations for Low-latency Time Series Analysis. Henggang Cui, Kimberly Keeton, Indrajit Roy, Krishnamurthy Viswanathan, Gregory R. Ganger. ACM Symposium on Cloud Computing 2015. Aug. 27 - 29, 2015, Kohala Coast, HI. See the extended Technical Report for more information.
Abstract / PDF [1.3M]
Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics. Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho*, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-15-105. April 2015.
Abstract / PDF [2.62M]
Solving the Straggler Problem for Iterative Convergent Parallel ML. Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-15-102, April 2015.
Abstract / PDF [519K]
High-Performance Distributed ML at Scale through Parameter Server Consistency Models. Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, Eric P. Xing. 29th AAAI Conf. on Artificial Intelligence (AAAI-15), Jan 25-29, 2015, Austin, Texas.
Abstract / PDF [733K]
On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, E. P. Xing. Proceedings of 2014 Neural Information Processing Systems (NIPS’14), December 2014.
Abstract / PDF [336K]
Exploiting Iterative-ness for Parallel ML Computations. Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai, Jesse Haber-Kucharsky, Qirong Ho, Greg R. Ganger, Phil B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2014 (SoCC'14), Seattle, WA, Nov 2014. Supersedes Carnegie Mellon University Parallel Data Technical Report CMU-PDL-14-107.
Abstract / PDF [609K]
Exploiting Bounded Staleness to Speed up Big Data Analytics. Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. 2014 USENIX Annual Technical Conference (ATC'14). June 19-20, 2014. Philadelphia, PA. Supersedes CMU-PDL-14-101.
Abstract / PDF [731K]
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, Eric P. Xing. Conference on Neural Information Processing Systems (NIPS '13). Dec 5-8, 2013, Lake Tahoe, NV.
Abstract / PDF [2.64M] / Appendix
Solving the Straggler Problem with Bounded Staleness. James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, Eric Xing. 14th USENIX HotOS Workshop, Santa Ana Pueblo, NM, May 13-15, 2013.
Abstract / PDF [174K]

Presentations

A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning
Eric Xing, Qirong Ho (KDD 2015 tutorial)
PDF [12MB]

Acknowledgements

We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.