Parallel Data Laboratory

DISC: Data-Intensive Super Computing

The leading Internet search providers have created a new class of large-scale computer systems to support their businesses. We are formulating a plan for a research project that extends the type of computing systems used for Internet search to a larger range of applications. We refer to such systems as "Data-Intensive Super Computing" (DISC) systems. DISC systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, numerical simulations, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business efficiencies, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing.

Applications

Web search without language barriers.
Inferring biological function from genomic sequences
Predicting and modeling the effects of earthquakes
Discovering new astronomical phenomena from telescope imagery data
Synthesizing realistic graphic animations
Understanding the spatial and temporal patterns of brain behavior based on MRI data

Research Areas

Programming models for DISC systems
Methodologies and tools for supporting software development in DISC systems
Runtime software support for DISC systems
Resource management and sharing
Hardware and processor design for DISC systems.

Challenges

How should the processors be designed for use in cluster machines?
How can we effectively support different scientific communities in their data management and applications?
Can we radically reduce the energy requirements for large-scale systems?
How do we build large-scale computing systems with an appropriate balance of performance and cost?
How can very large systems be constructed given the realities of component failures and repair times?
Can we support a mix of long-running data-intensive jobs with ones requiring interactive response?
How do we control access to the system while enabling sharing?
Can we deal with bad or unavailable data in a systematic way?
Can high performance systems be build from heterogeneous components?

News

Yahoo! press releases:

Yahoo! Reaches for the Stars with M45 Supercomputing Project
Yahoo! Launches New Program to Advance Open-Source Software for Internet Computing
Carnegie Mellon University First to Take Advantage of Yahoo!'s Large-Scale Hardware and Software Investments for the Open Source Community

Associated Projects

Problem Diagnosis in Distributed Systems
Large-scale scene matching
Large scale graph mining
REAP databases
Grammar Induction
Improving Article, User, and Group Effectiveness in Wikipedia

People

FACULTY

Randy Bryant
Greg Ganger
Garth Gibson
Julio López
David O'Hallaron

GRADUATE STUDENTS

Wittawat Tantisiriroj

EXTERNAL COLLABORATORS

Gary Grider (LANL)
James Nunez (LANL)
Jay Kistler (Yahoo!)
Chris Olston (Yahoo!)

Publications

Applying Idealized Lower-bound Runtime Models to Understand Inefficiencies in Data-intensive Computing (Extended Abstract). Elie Krevat, Tomer Shiran, Eric Anderson, Joseph Tucek, Jay J. Wylie, Gregory R. Ganger: SIGMETRICS 2011: 125-126, San Jose, CA, June 7-11, 2011.
Abstract / PDF [297K]
Applying Performance Models to Understand Data-intensive Computing Efficiency. Elie Krevat, Tomer Shiran, Eric Anderson†, Joseph Tucek†, Jay J. Wylie†, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-108. May 2010.
Abstract / PDF [304K]
Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate. Garth Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin Xiao. Microsoft Research eScience Workshop 2009, Pittsburgh, PA, October 16-17, 2009.
Abstract / PDF [520K]
Data-Intensive Supercomputing: The Case for DISC. Randal E. Bryant. Carnegie Mellon University School of Computer Science Tech Report CMU-CS-07-128. May 10, 2007.
PDF
Data-Intensive Supercomputing: Presentation to the 2007 Federated Computing Research Conference (FCRC)
V1 | Revised Version

Presentations

Improving Storage Services in the Cloud. Garth Gibson. 3rd Open Cirrus Summit, Seoul, Korea, June 8-9, 2010.
Quicktime MOV [250MB, ~15 min]

Acknowledgements

We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.