PDL Abstract

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 27–29, 2017, Boston, MA.

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger,
Phillip B. Gibbons, Onur Mutlu

Carnegie Mellon University

Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. Unfortunately, it is infeasible to move all this globally-generated data to a centralized data center before running an ML algorithm over it -- moving large amounts of raw data over a wide-area network (WAN) can be extremely slow, and is also subject to national privacy law constraints. This motivates the need for a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over WANs can significantly degrade ML system performance (by as much as 53.7X in our study) because the communication overwhelms the limited WAN bandwidth.

Our goal in this work is to develop a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and correctness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms.

To this end, we introduce a new, general geo-distributed ML system, Gaia, that decouples the communication within a data center from the communication between data centers, enabling different communication and consistency models for each. We present a new ML synchronization model, Approximate Synchronous Parallel (ASP), which dynamically adjusts to the available WAN bandwidth between data centers and eliminates the vast majority of insignificant communication while still guaranteeing the correctness of ML algorithms. Our experiments on our prototypes of Gaia running across the 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that Gaia provides 1.8-53.5X speed-up over a state-of-art distributed ML system, and is within 0.94-1.56X of ML-on-LAN speeds.