Data-Intensive Super Computing (DISC)

THIS PAGE HAS MOVED. PLEASE UPDATE YOUR BOOKMARKS. IF YOU ARE NOT REDIRECTED IN A FEW SECONDS, PLEASE CLICK HERE TO GO TO OUR NEW PAGE.

Contact: Julio López, Garth Gibson

The leading Internet search providers have created a new class of large-scale computer systems to support their businesses. We are formulating a plan for a research project that extends the type of computing systems used for Internet search to a larger range of applications. We refer to such systems as "Data-Intensive Super Computing" (DISC) systems. DISC systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, numerical simulations, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business efficiencies, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing.

Applications

Web search without language barriers.
Inferring biological function from genomic sequences
Predicting and modeling the effects of earthquakes
Discovering new astronomical phenomena from telescope imagery data
Synthesizing realistic graphic animations
Understanding the spatial and temporal patterns of brain behavior based on MRI data

Research Areas

Programming models for DISC systems
Methodologies and tools for supporting software development in DISC systems
Runtime software support for DISC systems
Resource management and sharing
Hardware and processor design for DISC systems.

Challenges

How should the processors be designed for use in cluster machines?
How can we effectively support different scientific communities in their data management and applications?
Can we radically reduce the energy requirements for large-scale systems?
How do we build large-scale computing systems with an appropriate balance of performance and cost?
How can very large systems be constructed given the realities of component failures and repair times?
Can we support a mix of long-running data-intensive jobs with ones requiring interactive response?
How do we control access to the system while enabling sharing?
Can we deal with bad or unavailable data in a systematic way?
Can high performance systems be build from heterogeneous components?

News

Yahoo! press releases:

Yahoo! Reaches for the Stars with M45 Supercomputing Project
Yahoo! Launches New Program to Advance Open-Source Software for Internet Computing
Carnegie Mellon University First to Take Advantage of Yahoo!'s Large-Scale Hardware and Software Investments for the Open Source Community

Associated Projects

Problem Diagnosis in Distributed Systems
Large-scale scene matching
Large scale graph mining
REAP databases
Grammar Induction
Improving Article, User, and Group Effectiveness in Wikipedia

People

FACULTY

GRADUATE STUDENTS

Wittawat Tantisiriroj

EXTERNAL COLLABORATORS

Steve Schlosser (Intel)
Gary Grider (LANL)
James Nunez (LANL)
Jay Kistler (Yahoo!)
Chris Olston (Yahoo!)

Publications and Presentations

Data-Intensive Supercomputing: The Case for DISC. Randal E. Bryant. Carnegie Mellon University School of Computer Science Tech Report CMU-CS-07-128. May 10, 2007.
PDF
Data-Intensive Supercomputing: Presentation to the 2007 Federated Computing Research Conference (FCRC)
V1 | Revised Version