PDL Abstract

Solving the Straggler Problem for Iterative Convergent Parallel ML

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-15-102. April 2015.

Aaron Harlap^, Henggang Cui^, Wei Dai^, Jinliang Wei^, Gregory R. Ganger^, Phillip B. Gibbons^*,
Garth A. Gibson^, Eric P. Xing

^ Carnegie Mellon University
* Intel Labs

Parallel executions of iterative machine learning (ML) algorithms can suffer significant performance losses to stragglers. The regular (e.g., per iteration) barriers used in the traditional BSP approach cause every transient slowdown of any worker thread to delay all others. This paper describes a scalable, efficient solution to the straggler problem for this important class of parallel ML problems, combining a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers. Experiments with both synthetic straggler behaviors and real straggler behavior observed on Amazon EC2 confirm the significance of the problem and the effectiveness of the solution, as implemented in a framework called FlexRR. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all straggler patterns tested.

KEYWORDS: Big Data infrastructure, Big Learning systems

FULL TR: pdf