Problem Analysis

Automating problem analysis is crucial to achieving maintainable systems at the scales needed for tomorrow's high-end computing. Our research explores methodologies and algorithms for automating analysis of failures and performance degradations in large-scale systems, such as distributed storage. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Fingerpointing is one approach to problem diagnosis that combines node-level (local) anomaly detection, followed by system-wide (global) detection.
 
By combining statistical tools with appropriate instrumentation, we hope to significantly reduce the difficulty of analyzing performance and reliability problems in deployed large-scale systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing. Obtaining meaningful results will involve understanding which and how well statistical tools work to meet the challenge of problem detection/prediction. It will also involve quantifying the impact of instrumentation detail on the effectiveness of those tools so as to guide justification for associated instrumentation costs. Explorations will be done primarily in the framework of the Ursa Minor/Major cluster-based storage systems via fault injection and analysis of case studies observed in deployment.

People

FACULTY

Chuck Cranor
Christos Faloutsos
Greg Ganger
Rajeev Gandhi
Priya Narasimhan
Bianca Schroeder (postdoc)
Alice Zheng (postdoc)

STAFF

Gregg Economou
Michael Stroucken

STUDENTS

Mike Kasick
Soila Pertet Kavulya
Raja Sambasivan
Mike Abd-El-Malek
Eno Thereska
Evan Hoke
Jimeng Sun
John StrunK

Publications

  • Visualizing Request-flow Comparison to Aid Performance Diagnosis in Distributed Systems. Raja R. Sambasivan, Ilari Shafer, Michelle L. Mazurek, Gregory R. Ganger. IEEE Transactions on Visualization and Computer Graphics (Proceedings Information Visualization 2013), vol. 19, no. 12, Dec. 2013.
    Abstract / PDF [1.9M] / TRAILER VIDEO [5.6M] / VIDEO [17.9M]

  • Automated Diagnosis of Chronic Performance Problems in Production Systems. Soila P. Kavulya. Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation. CMU-PDL-13-109, May 2013.
    Abstract / PDF [12.6M]

  • Diagnosing Performance Changes in Distributed Systems by Comparing Request Flows. Raja R. Sambasivan. Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation. CMU-PDL-13-105, May 2013.
    Abstract / PDF [3.9M]

  • Visualizing Request-flow Comparison to Aid Performance Diagnosis in Distributed Systems. Raja R. Sambasivan, Ilari Shafer, Michelle L. Mazurek, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-13-104 (supersedes CMU-PDL-12-102), April 2013.
    Abstract / PDF [1.93M]

  • Light-weight Black-box Failure Detection for Distributed Systems. Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-107. July 2012
    Abstract / PDF [300K]

  • Visualizing Request-flow Comparison to Aid Performance Diagnosis in Distributed Systems. Raja R. Sambasivan, Ilari Shafer, Michelle L. Mazurek, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-102. May 2012.
    Abstract / PDF [1.13M]

  • Automated Diagnosis without Predictability is a Recipe for Failure. Raja R. Sambasivan & Gregory R. Ganger. Proceedings of the 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '12), June 12-13, 2012, Boston, MA. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-11-101.
    Abstract / PDF [368K]

  • Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. Soila P. Kavulya, Scott Daniels (AT&T), Kautubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi, Priya Narasimhan.IEEE/IFIP Conference on Dependable Systems and Networks (DSN), June 2012.
    Abstract / PDF [859K]

  • End-to-end Tracing in HDFS. William Wang Carnegie Mellon University School of Computer Science Technical Report (Masters Thesis) CMU-CS-11-120, July 2011.
    Abstract / PDF [489K]

  • Diagnosis in Automotive Systems: A Survey. Patrick E. Lanigan, Soila Kavulya, Priya Narasimhan, Thomas E. Fuhrman, Mutasim A. Salman. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-11-110. June 2011.
    Abstract / PDF [369K]

  • Automation Without Predictability is a Recipe for Failure. Raja R. Sambasivan, Gregory R. Ganger. Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-11-101, January 2011.
    Abstract / PDF [336K]

  • Draco: Top-Down Statistical Diagnosis of Large-scale VoIP Networks. Soila P. Kavulya, Kaustubh Joshi, Matti Hiltunen, Scott Daniels, Rajeev Gandhi, Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-11-109, April 2011.
    Abstract / PDF [787K]

  • Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger. 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI'11). March 30 - April 1, 2011. Boston, MA.
    Abstract / PDF [388K]

  • Behavior-Based Problem Localization for Parallel File Systems. Michael P. Kasick, Rajeev Gandhi, Priya Narasimhan. HotDep '10. October 3, 2010, Vancouver, BC, Canada.
    Abstract / PDF [149K]

  • To Upgrade or Not to Upgrade: Impact of Online Upgrades across Multiple Administrative Domains. T. Dumitras, E. Tilevich, P.Narasimhan. ACM Onward! Conference, Oct. 2010.
    Abstract / PDF [425K]

  • Diagnosing Performance Changes by Comparing System Behaviours. Raja R. Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-107. July 2010. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-103.
    Abstract / PDF [503K]

  • Diagnosing Performance Problems by Visualizing and Comparing System Behaviours. Raja R. Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-103, February 2010.
    Abstract / PDF [423K]

  • Why Do Upgrades Fail And What Can We Do About It? Toward Dependable, Online Upgrades in Enterprise Systems. T. Dumitras, P. Narasimhan. ACM/IFIP/USENIX Middleware Conference, Nov-Dec. 2009.
    Abstract / PDF [835K]

  • Toward Upgrades-as-a-Service in Distributed Systems. T. Dumitras, P. Narasimhan. Poster Session at Middleware 2009. 10th International Middleware Conference Urbana Champaign, Illinois, USA.
    Abstract / PDF [602K]
  • ASDF: Automated, Online Fingerpointing for Hadoop. Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-104. May 2008.
    Abstract / PDF [650K]

  • Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems. Amar Phanishayee, Elie Krevat, Vijay Vasudevan, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Srinivasan Seshan. 6th USENIX Conference on File and Storage Technologies (FAST '08). Feb. 26-29, 2008. San Jose, CA. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-07-105, Sept. 2007.
    Abstract / PDF [374K]

  • Categorizing and Differencing System Behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno Thereska, Gregory R. Ganger. Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL.
    Abstract / PDF [120K]

  • Fingerpointing Correlated Failures in Replicated Systems. Soila Pertet, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
    Abstract / PDF [100K]

  • Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? Bianca Schroeder, Garth A. Gibson. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07), February 13–16, 2007, San Jose, CA. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-111, September 2006.
    Abstract / PDF[ 272K]

  • Observer: Keeping System Models from Becoming Obsolete. Eno Thereska, Dushyanth Narayanan, Anastassia Ailamaki, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-07-101, January 2007.
    Abstract / Postscript [610K ] / PDF[ 135K]

  • Towards Fingerpointing in the Emulab Dynamic Distributed System. Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau. Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA. Nov. 5, 2006.
    Abstract / PDF [311K]

  • InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, and Christos Faloutsos. ACM SIGOPS Operating Systems Review. Vol 40 Issue 3. July, 2006. ACM Press.
    Abstract / PDF [573K]

  • Group Communication: Helping or Obscuring Failure Diagnosis? Soila Pertet, Rajeev Gandhi and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-107, June, 2006.
    Abstract / PDF [ 591K]

  • Stardust: Tracking Activity in a Distributed Storage System. Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, Gregory R. Ganger. Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, (SIGMETRICS'06). June 26th-30th 2006, Saint-Malo, France.
    Abstract / PDF [578K]

  • Causes of Failure in Web Applications. Soila Pertet and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-05-109. December 2005.
    Abstract / PDF [367K]

  • A Large-scale Study of Failures in High-performance-computing Systems. Bianca Schroeder, Garth Gibson. Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), Philadelphia, PA, USA, June 25-28, 2006. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-05-112, December, 2005.
    Abstract / PDF [570K]

Acknowledgements

Thank you to Google for support via a Google research grant.

This material is based on research sponsored in part by the National Science Foundation, via grants CCF-0621508 and CNS-0326453, by the Army Research Office,under agreement number DAAD19-02-1-0389, and by the Department of Energy under Award Number DE-FC02- 06ER25767.

We thank the members and companies of the PDL Consortium: Amazon, Facebook, Google, Hewlett Packard Enterprise, Hitachi Ltd., Intel Corporation, IBM, Microsoft Research, NetApp, Inc., Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Seagate Technology, Two Sigma, and Western Digital for their interest, insights, feedback, and support.