Litz: An Elastic Framework for High-Performance Distributed Machine Learning

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-17-103, June 2017.

Aurick Qiao*^, Abutalib Aghayev^, Weiren Yu*, Haoyang Chen*, Qirong Ho*, Garth A. Gibson^,
Eric P. Xing*^

*Petuum, Inc.
^Carnegie Mellon University


Machine Learning (ML) is becoming an increasingly popular application in the cloud and data-centers, inspiring a growing number of distributed frameworks optimized for it. These frameworks leverage the specific properties of ML algorithms to achieve orders of magnitude performance improvements over generic data processing frameworks like Hadoop or Spark. However, they also tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of the multi-tenant environments in which they run. Furthermore, the programming models provided by these frameworks tend to be restrictive, narrowing their applicability even within the sphere of ML workloads.

Motivated by these trends, we present Litz, a distributed ML framework that achieves both elasticity and generality without giving up the performance of more specialized frameworks. Litz uses a programming model based on scheduling micro-tasks with parameter server access which enables applications to implement key distributed ML techniques that have recently been introduced. Furthermore, we believe that the union of ML and elasticity presents new opportunities for job scheduling due to dynamic resource usage of ML algorithms. We give examples of ML properties which give rise to such resource usage patterns and suggest ways to exploit them to improve resource utilization in multi-tenant environments.

To evaluate Litz, we implement two popular ML applications that vary dramatically terms of their structure and run-time behavior—they are typically implemented by different ML frameworks tuned for each. We show that Litz achieves competitive performance with the state of the art while providing low-overhead elasticity and exposing the underlying dynamic resource usage of ML applications.

KEYWORDS: distributed systems, machine learning, parameter server, elasticity

FULL TR: pdf




© 2018. Last updated 26 June, 2017