FINGERPOINTING:
Problem Diagnosis in Distributed Systems
Contact: Priya Narasimhan
Distributed systems contain multiple hardware and software components that can interact across multiple nodes/subsystems in sometimes unforeseen and complicated ways. As a result, determining the root cause of failures in these systems can be a very frustrating experience that might take several hours or even days.
Problem diagnosis (or fingerpointing) involves instrumenting systems to yield meaningful data, detecting errors and/or failures within these systems, and ascertaining their root-cause, i.e., the underlying fault. Fingerpointing is difficult because the distributed interactions, protocols and inter-component dependencies in computer systems can cause a problem to change ``shape'' or manifestation, leading to potential red herrings in problem determination. There can be many root causes of an outward manifestation of a problem and there might be insufficient information to distinguish between the various root causes. On the other hand, too much monitoring and too many error messages might overwhelm the system, obscure the root cause, and lead to increased latencies and additional resource costs.
We are currently developing a variety of techniques for automated fingerpointing in distributed systems -- the aim is to perform online and offline root-cause analyses in order to identify a faulty node/process, diagnose the source of the problem, and report it to the user or administrator in a meaningful/useful manner.
We ultimately aim for a preemptive strategy (where we need not wait for any instability or problem to manifest into system-wide outage before taking remedial action) that might improve the system's overall responsiveness and availability. The idea is to observe the trends of various key metrics in the system to ascertain which of these can be good indicators of the overall health of the system, and which metrics (if monitored appropriately and at the right frequency) could herald a potential outage. Thus, our techniques aim for two key elements: diagnosis of the root cause of the problem, and (where possible) a proactive indication of an imminent critical problem in the system that averts a total system failure.
People
FACULTY
Priya Narasimhan
Rajeev Gandhi
GRAD STUDENTS
Soila Pertet
Michael P. Kasick
Patrick E. Lanigan
Jiaqi Tan
Xinghao Pan
Keith Bare
Eugene Marinelli
Publications
- Log-Based Approaches to Characterizing and Diagnosing MapReduce Systems. Jiaqi Tan. School of Computer Science Master's Thesis CMU-CS-09-143, Carnegie Mellon University, July 2009.
Abstract / PDF
- Ganesha: Black-Box Fault Diagnosis for MapReduce Systems. Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics 2009), Seattle, WA (June 2009). Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-112. September 2008.
Abstract / PDF [180K]
- Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Workshop on Hot Topics in Cloud Computing (HotCloud '09), San Diego, CA, on June 15, 2009. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-103, May 2009.
Abstract / PDF [373K].
- System-Call Based Problem Diagnosis for PVFS. Michael P. Kasick, Keith A. Bare, Eugene E. Marinelli III, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep '09). Lisbon, Portugal. June 2009.
Abstract / PDF [117K]
- Diagnosing Performance Problems in Parallel File Systems. Michael P. Kasick. Electrical & Computer Engineering Department Master's Thesis, Carnegie Mellon University, May 2009.
PDF
- The Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. Xinghao Pan. School of Computer Science Master's Thesis CMU-CS-09-135, Carnegie Mellon University, May 2009.
Abstract / PDF
- SALSA: Analyzing Logs as StAte Machines.SALSA: Analyzing Logs as StAte Machines. Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Analysis of System Logs (WASL), San Diego, CA (December 2008). Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-111. September 2008.
Abstract / PDF [630K]
- RAMS and BlackSheep: Inferring White-box
Application Behavior Using Black-box Techniques. Jiaqi Tan, Priya Narasimhan. School of Computer Science Senior Honors Thesis and Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-103, May, 2008.
Abstract / PDF [1.7M]
- ASDF: Automated, Online Fingerpointing for Hadoop. Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Carnegie
Mellon University Parallel Data Lab Technical Report
CMU-PDL-08-104.
May 2008.
Abstract / PDF [650K]
- Fingerpointing Correlated Failures in Replicated Systems. Soila Pertet, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Abstract / PDF [100K]
- Towards Fingerpointing in the Emulab Dynamic Distributed System. Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau. Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA. Nov. 5, 2006.
Abstract / PDF [311K]
- Group Communication: Helping or Obscuring Failure Diagnosis? Soila Pertet, Rajeev Gandhi and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-107, June, 2006.
Abstract / PDF [ 591K]
- Causes of Failure in Web Applications. Soila Pertet and Priya
Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report
CMU-PDL-05-109. December 2005.
Abstract / PDF [367K]
Acknowledgements
We thank the members and companies of the PDL Consortium: American Power Conversion, Data Domain, Inc., EMC Corporation, Facebook, Google, Hewlett-Packard Labs, Hitachi, IBM, Intel Corporation, LSI, Microsoft Research, NetApp, Inc., Oracle Corporation, Seagate Technology, Sun Microsystems, Symantec Corporation and VMware, Inc. for their interest, insights, feedback, and support.