DATE: Monday, September 19, 2016
TIME: 12:00 pm - 1:00 pm
PLACE: GHC 8102

SPEAKER: Joseph Bradley, Databricks

TITLE: Foundations for Scaling Analytics in Apache Spark

ABSTRACT:
We will overview current and future work on building foundations for scaling machine learning and graph processing in Apache Spark.

Apache Spark is the most active open source Big Data project, with 1000+ contributors. The ability to scale is a key benefit of Spark: the same code should run on a laptop or 100's to 1000's of machines. Another big attraction is integration of analytics libraries for machine learning (ML) and graph processing.

This talk will cover the juncture between the low-level (scaling) and high-level (analytics) components of Spark. The most important change for ML and graphs on Spark in the past year has been a migration of analytics libraries to use Spark DataFrames instead of RDDs. This ongoing migration is laying the groundwork for future speedups and scaling. In addition to API impacts, we will discuss the integration of analytics with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management and code generation. [slides]

BIO:
Joseph Bradley is an Apache Spark committer and PMC member, working as a Software Engineer at Databricks. He focuses on Spark MLlib, GraphFrames, and other advanced analytics on Spark. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
http://spark.apache.org/
https://databricks.com/
http://www.cs.cmu.edu/~jkbradle/

VISITOR HOSTS: Majd Sakr

SDI / ISTC SEMINAR QUESTIONS?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/