NOTE SPECIAL DAY - MONDAY
Monday, September 19, 2016
SPEAKER: Joseph Bradley, Databricks
TITLE: Foundations for Scaling Analytics in Apache Spark
Apache Spark is the most active open source Big Data project, with 1000+ contributors. The ability to scale is a key benefit of Spark: the same code should run on a laptop or 100's to 1000's of machines. Another big attraction is integration of analytics libraries for machine learning (ML) and graph processing.
This talk will cover the juncture between the low-level (scaling) and high-level (analytics) components of Spark. The most important change for ML and graphs on Spark in the past year has been a migration of analytics libraries to use Spark DataFrames instead of RDDs. This ongoing migration is laying the groundwork for future speedups and scaling. In addition to API impacts, we will discuss the integration of analytics with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management and code generation. [slides]
VISITOR HOSTS: Majd Sakr