PROBLEM ANALYSIS
Contact: Greg Ganger, Priya Narasimhan
Automating problem analysis is crucial to achieving maintainable systems at the scales needed for tomorrow's high-end computing. Our research explores methodologies and algorithms for automating analysis of failures and performance degradations in large-scale systems, such as distributed storage. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Fingerpointing is one approach to problem diagnosis that
combines node-level (local) anomaly detection, followed
by system-wide (global) detection.
By combining statistical tools with appropriate instrumentation, we hope to significantly reduce the difficulty of analyzing performance and reliability problems in deployed large-scale systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing. Obtaining meaningful results will involve understanding which and how well statistical tools work to meet the challenge of problem detection/prediction. It will also involve quantifying the impact of instrumentation detail on the effectiveness of those tools so as to guide justification for associated instrumentation costs. Explorations will be done primarily in the framework of the Ursa Minor/Major cluster-based storage systems via fault injection and analysis of case studies observed in deployment.
People
FACULTY
Chuck Cranor
Christos Faloutsos
Greg Ganger
Rajeev Gandhi
Priya Narasimhan
Bianca Schroeder (postdoc)
Alice Zheng (postdoc)
Gregg Economou
Michael Stroucken
Mike Kasick
Soila Pertet
Raja Sambasivan
Mike Abd-El-Malek
Eno
Thereska
Evan Hoke
Jimeng Sun
John StrunK
Publications
- ASDF: Automated, Online Fingerpointing for Hadoop. Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Carnegie
Mellon University Parallel Data Lab Technical Report
CMU-PDL-08-104.
May 2008.
Abstract / PDF [650K]
- Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems. Amar Phanishayee, Elie Krevat, Vijay Vasudevan, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Srinivasan Seshan. 6th USENIX Conference on File and Storage Technologies (FAST '08). Feb. 26-29, 2008. San Jose, CA. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-07-105, Sept. 2007.
Abstract / PDF [374K]
- Categorizing and Differencing System Behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno Thereska, Gregory R. Ganger. Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL.
Abstract / PDF [120K]
- Fingerpointing Correlated Failures in Replicated Systems. Soila Pertet, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Abstract / PDF [100K]
- Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? Bianca Schroeder, Garth A. Gibson. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07),
February 13–16, 2007, San Jose, CA. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-111, September 2006.
Abstract / PDF[ 272K]
- Observer: Keeping System Models from Becoming Obsolete.
Eno Thereska, Dushyanth Narayanan, Anastassia Ailamaki, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-07-101, January 2007.
Abstract / Postscript [610K ] / PDF[ 135K]
- Towards Fingerpointing in the Emulab Dynamic Distributed System. Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau. Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA. Nov. 5, 2006.
Abstract / PDF [311K]
- InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, and Christos Faloutsos. ACM SIGOPS Operating Systems Review. Vol 40 Issue 3. July, 2006. ACM Press.
Abstract / PDF [573K]
- Group Communication: Helping or Obscuring Failure Diagnosis? Soila Pertet, Rajeev Gandhi and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-107, June, 2006.
Abstract / PDF [ 591K]
- Stardust: Tracking Activity in a Distributed Storage System.
Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek,
Julio Lopez, Gregory R. Ganger. Proceedings of the Joint International
Conference on Measurement and Modeling of Computer Systems, (SIGMETRICS'06).
June 26th-30th 2006, Saint-Malo, France.
Abstract / PDF [578K]
- Causes of Failure in Web Applications. Soila Pertet and Priya
Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report
CMU-PDL-05-109. December 2005.
Abstract / PDF [367K]
- A Large-scale Study of Failures in High-performance-computing Systems.
Bianca Schroeder, Garth Gibson. Proceedings of the International Conference
on Dependable Systems and Networks (DSN2006), Philadelphia, PA, USA,
June 25-28, 2006. Supercedes Carnegie Mellon University Parallel Data
Lab Technical Report CMU-PDL-05-112, December, 2005.
Abstract / PDF [570K]
Acknowledgements
Thank you to Google for support via a Google research grant.
We thank the members and companies of the PDL Consortium: American Power Conversion, Data Domain, Inc., EMC Corporation, Facebook, Google, Hewlett-Packard Labs, Hitachi, IBM, Intel Corporation, LSI, Microsoft Research, NetApp, Inc., Oracle Corporation, Seagate Technology, Sun Microsystems, Symantec Corporation and VMware, Inc. for their interest, insights, feedback, and support.
This material is based on research sponsored in part by the National Science Foundation, via grants CCF-0621508 and CNS-0326453, by the Army Research Office,under agreement number DAAD19-02-1-0389, and by the Department of Energy under Award Number DE-FC02- 06ER25767.