DATE: Monday, July 25, 2011 - NOTE SPECIAL DAY
TIME: 12:00 pm - 1:00 pm
CIC 2101

SPEAKER: Matei Zaharia, UC Berkeley

TITLE: Spark: In-Memory Cluster Computing for Iterative and Interactive Applications

MapReduce and its variants have been highly successful in supporting data-intensive cluster applications. However, these systems are based on an acyclic data flow model that does not capture some important use cases. We present Spark, a new cluster computing framework motivated by one such class of use cases: applications that reuse a working set of data in multiple parallel operations. This includes many iterative machine learning and graph algorithms, as well as interactive data mining tools. Spark augments the data flow model with a fault-tolerant distributed memory abstraction called resilient distributed datasets (RDDs), allowing it to outperform Hadoop by up to 20x for these applications. RDDs are general enough to express MapReduce, Dryad, Pregel, and other computation models within a job. Spark also simplifies application programming by integrating into the Scala language. Finally, Spark's ability to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big data. We've modified the Scala interpreter to make it possible to use Spark interactively as a highly responsive data analytics tool.


Matei Zaharia is a fourth year graduate student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems and networking. He is also a committer on Apache Hadoop. He got his undergraduate degree at the University of Waterloo in Canada.

SDI / LCS Seminar Questions?
Karen Lindenfelser, 86716, or visit