Big Learning (Systems for ML)
Contact: Greg Ganger, Phil Gibbons, Garth Gibson, Eric Xing, Jinliang Wei
Data analytics (a.k.a. Big Data) has emerged as a primary data processing activity for business, science, and online services that attempt to extract insight from quantities of observation data. Increasingly, such analytics center on statistical machine learning (ML), in which an algorithm determines model parameters that make a chosen statistical model best fit the training data. Once fit (trained), such models can expose relationships among data items (e.g., for grouping documents into topics), predict outcomes for new data items based on selected characteristics (e.g., for recommendation systems), correlate effects with causes (e.g., for genomic analyses of diseases), and so on.
Growth in data sizes and desired model precision generally dictates parallel execution of ML algorithms on clusters of servers. Naturally, parallel ML involves the same work distribution, synchronization and data consistency challenges as other parallel computations. The PDL big-learning group has attacked these challenges, creating and refining powerful new approaches for supporting large-scale ML on Big Data. This short article overviews an inter-related collection of our efforts in this space. Click here for a short article overviewing an inter-related collection of our efforts in this space.
-
For more information on some of our ongoing work, see:
- Consistency Theory
Due to its error-tolerant nature of ML training, the distributed workers may execute the ML algorithm using inconsistent view of the shared model state while still produce a correct outcome. The classic Bulk Synchronous Parallel (BSP) model allows workers to compute using locally cached model state, but still suffers from the synchronization barrier. We have developed a novel consistency model - Stale Synchronous Parallel, which hides the synchronization latency and provides theoretical guarantees for algorithm convergence. It has laid the foundation for many of our system designs. - Exploiting Iterativeness
Due to its iterative nature, the repeated access pattern can be exploited to speed up the execution of ML algorithms. - Network Bandwidth Management
While using stale local reads may still leads to correct outcome, the error in reads reduces algorithm performance and leads to longer training time. In this work, we are exploring the opportunity of using spare network bandwidth to reduce staleness and thus speed up training. - Straggler Mitigation
- STRADS: Parallel ML performance depends on how we parallelize ML algorithm and its underlying framework performance. In this study, we present a comprehensive approach that considers both ML and system at the same time. We present SchMP (Scheduled Model Parallelism) scheme on the ML side to maximize statistical progress of parallel ML algorithm and STRADS framework on the system side to maximize throughput of SchMP programs in a distributed environment.
People
FACULTY
Greg Ganger
Phil Gibbons
Garth Gibson
Eric Xing
GRAD STUDENTS
Jin Kyu Kim
Henggang Cui
Wei Dai
Jinliang Wei
Aaron Harlap
ALUMNI
Partners
Publications
- Tributary: Spot-dancing for elastic services with latency SLOs. Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, Phillip B. Gibbons. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-18-102, Jan. 2018.
Abstract / PDF [990K]
- Aging Gracefully with Geriatrix: A File System Aging Tool. Saurabh Kadekodi, Vaishnavh Nagarajan, Garth A. Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-17-106, October 2017. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-105. October, 2016.
Abstract / PDF [560K]
- Litz: An Elastic Framework for High-Performance Distributed Machine Learning. Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, Eric P. Xing. Carnegie Mellon Univedrsity Parallel Data Laboratory Technical Report CMU-PDL-17-103. June 2017.
Abstract / PDF [424K]
- Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Greg Ganger, Phil Gibbons. ACM European Conference on Computer Systems, 2017 (EuroSys'17), 23rd-26th April, 2017, Belgrade, Serbia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-102. May 2016.
Abstract / PDF [743K]
- Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 27–29, 2017, Boston, MA.
Abstract / PDF [1.5M]
- MLtuner: System Support for Automatic Machine Learning Tuning. Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-108, October 2016.
Abstract / PDF [900K]
- Benchmarking Apache Spark with Machine Learning Applications. Jinliang Wei, Jin Kyu Kim, Garth A. Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-107 October 2016.
Abstract / PDF [360K]
- Addressing the Straggler Problem for Iterative Convergent Parallel ML. Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2016. Oct 5-7, Santa Clara, CA. Supersedes Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-15-102, April 2015.
Abstract / PDF [519K]
- GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server. Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. ACM European Conference on Computer Systems, 2016 (EuroSys'16), 18th-21st April, 2016, London, UK.
Abstract / PDF [617K]
- STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing. ACM European Conference on Computer Systems, 2016 (EuroSys'16), 18th-21st April, 2016, London, UK.
Abstract / PDF [1.6M]
- Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics. Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2015. Aug. 27 - 29, 2015, Kohala Coast, HI.
Abstract / PDF [369K]
- Using Data Transformations for Low-latency Time Series Analysis. Henggang Cui, Kimberly Keeton, Indrajit Roy, Krishnamurthy Viswanathan, Gregory R. Ganger. ACM Symposium on Cloud Computing 2015. Aug. 27 - 29, 2015, Kohala Coast, HI. See the extended Technical Report for more information.
Abstract / PDF [1.3M]
- Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics. Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho*, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-15-105. April 2015.
Abstract / PDF [2.62M]
- Solving the Straggler Problem for Iterative Convergent Parallel ML. Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-15-102, April 2015.
Abstract / PDF [519K]
- High-Performance Distributed ML at Scale through Parameter Server Consistency Models. Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, Eric P. Xing. 29th AAAI Conf. on Artificial Intelligence (AAAI-15), Jan 25-29, 2015, Austin, Texas.
Abstract / PDF [733K]
- On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, E. P. Xing. Proceedings of 2014 Neural Information Processing Systems (NIPS’14), December 2014.
Abstract / PDF [336K]
- Exploiting Iterative-ness for Parallel ML Computations. Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai, Jesse Haber-Kucharsky, Qirong Ho, Greg R. Ganger, Phil B. Gibbons, Garth A. Gibson, Eric P. Xing. ACM Symposium on Cloud Computing 2014 (SoCC'14), Seattle, WA, Nov 2014. Supersedes Carnegie Mellon University Parallel Data Technical Report CMU-PDL-14-107.
Abstract / PDF [609K]
- Exploiting Bounded Staleness to Speed up Big Data Analytics. Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing. 2014 USENIX Annual Technical Conference (ATC'14). June 19-20, 2014. Philadelphia, PA. Supersedes CMU-PDL-14-101.
Abstract / PDF [731K]
- More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, Eric P. Xing. Conference on Neural Information Processing Systems (NIPS '13). Dec 5-8, 2013, Lake Tahoe, NV.
Abstract / PDF [2.64M] / Appendix
- Solving the Straggler Problem with Bounded Staleness. James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, Eric Xing. 14th USENIX HotOS Workshop, Santa Ana Pueblo, NM, May 13-15, 2013.
Abstract / PDF [174K]
Presentations
- A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning
Eric Xing, Qirong Ho (KDD 2015 tutorial)
PDF [12MB]
Acknowledgements
We thank the members and companies of the PDL Consortium: Broadcom, Ltd., Dell EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi Ltd., Intel Corporation, IBM, Microsoft Research, MongoDB, NetApp, Inc., Oracle Corporation, Salesforce, Samsung Information Systems America, Seagate Technology, Toshiba, Two Sigma, Veritas and Western Digital for their interest, insights, feedback, and support.