PARALLEL DATA LAB 

PDL Abstract

Benchmarking Apache Spark with Machine Learning Applications

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-102. May 2016.

Jinliang Wei, Jin Kyu Kim, Garth A. Gibson

Carnegie Mellon University

http://www.pdl.cmu.edu/

We benchmarked Apache Spark with a popular parallel machine learning training application, Distributed Stochastic Gradient Descent for Matrix Factorization [5] and compared the Spark implementation with alternative approaches for communicating model parameters, such as scheduled pipelining using POSIX socket or MPI, and distributed shared memory (e.g. parameter server [13]). We found that Spark performance suffers substantial overhead with only modest model size (rank of a few hundreds). For example, the PySpark implementation using one single-core executor was about 3X slower than a serial out-of-core Python implemenation and 226X slower than a serial C++ implementation. With a modest dataset (Netflix dataset containing 100 million ratings), the PySpark implementation showed 5.5X speedup from 1 to 8 machines, using 1 core per machine. But it failed to achieve further speedup with more machines or gain speedup from using more cores per machine. While it’s still ongoing investigation, we believed that shuffling as Spark’s only scalable mechanism of propagating model updates is responsible for much of the overhead and more efficient approaches for communication could lead to much better performance.

KEYWORDS: distributed systems, machine learning, Apache Spark

FULL TR: pdf