Parallel Data Laboratory

Fingerpointing: Problem Diagnosis in Distributed Systems

Distributed systems contain multiple hardware and software components that can interact across multiple nodes/subsystems in sometimes unforeseen and complicated ways. As a result, determining the root cause of failures in these systems can be a very frustrating experience that might take several hours or even days.

Problem diagnosis (or fingerpointing) involves instrumenting systems to yield meaningful data, detecting errors and/or failures within these systems, and ascertaining their root-cause, i.e., the underlying fault. Fingerpointing is difficult because the distributed interactions, protocols and inter-component dependencies in computer systems can cause a problem to change ``shape'' or manifestation, leading to potential red herrings in problem determination. There can be many root causes of an outward manifestation of a problem and there might be insufficient information to distinguish between the various root causes. On the other hand, too much monitoring and too many error messages might overwhelm the system, obscure the root cause, and lead to increased latencies and additional resource costs.

We are currently developing a variety of techniques for automated fingerpointing in distributed systems -- the aim is to perform online and offline root-cause analyses in order to identify a faulty node/process, diagnose the source of the problem, and report it to the user or administrator in a meaningful/useful manner.

We ultimately aim for a preemptive strategy (where we need not wait for any instability or problem to manifest into system-wide outage before taking remedial action) that might improve the system's overall responsiveness and availability. The idea is to observe the trends of various key metrics in the system to ascertain which of these can be good indicators of the overall health of the system, and which metrics (if monitored appropriately and at the right frequency) could herald a potential outage. Thus, our techniques aim for two key elements: diagnosis of the root cause of the problem, and (where possible) a proactive indication of an imminent critical problem in the system that averts a total system failure.

People

FACULTY

Priya Narasimhan
Rajeev Gandhi

GRAD STUDENTS

Soila Pertet
Michael P. Kasick
Patrick E. Lanigan
Jiaqi Tan
Xinghao Pan
Keith Bare
Eugene Marinelli

Publications

Visual, Log-based Causal Tracing for Performance Debugging of MapReduce Systems. Jiaqi Tan*, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan. 30th IEEE International Conference on Distributed Computing Systems (ICDCS) 2010, Genoa, Italy, Jun 2010.
Abstract / PDF [2.1M]
An Analysis of Traces from a Production MapReduce Cluster. Soila Kavulya, Jiaqi Tan, Rajeev Gandhi and Priya Narasimhan. 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). May 17-20, 2010, Melbourne, Victoria, Australia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-107, December, 2009.
Abstract / PDF [832K]
Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments. Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium (NOMS) 2010, Osaka, Japan, Apr 2010.
Abstract / PDF [2.8M]
Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST '10), San Jose, CA, February 2010.
Abstract / PDF [533K]
Log-Based Approaches to Characterizing and Diagnosing MapReduce Systems. Jiaqi Tan. School of Computer Science Master's Thesis CMU-CS-09-143, Carnegie Mellon University, July 2009.
Abstract / PDF
Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. Xinghao Pan, Jiaqi Tan, Soila Kalvulya, Rajeev Gandhi, Priya Narasimhan. 20th IEEE International Symposium on Software Reliability Engineering (ISSRE), Industrial Track, Mysuru, India, Nov 2009.
Abstract / PDF [160K]
Ganesha: Black-Box Fault Diagnosis for MapReduce Systems. Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics 2009), Seattle, WA (June 2009). Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-112. September 2008.
Abstract / PDF [180K]
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan. Workshop on Hot Topics in Cloud Computing (HotCloud '09), San Diego, CA, on June 15, 2009. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-103, May 2009.
Abstract / PDF [373K].
System-Call Based Problem Diagnosis for PVFS. Michael P. Kasick, Keith A. Bare, Eugene E. Marinelli III, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep '09). Lisbon, Portugal. June 2009.
Abstract / PDF [117K]
Diagnosing Performance Problems in Parallel File Systems. Michael P. Kasick. Electrical & Computer Engineering Department Master's Thesis, Carnegie Mellon University, May 2009.
PDF
The Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. Xinghao Pan. School of Computer Science Master's Thesis CMU-CS-09-135, Carnegie Mellon University, May 2009.
Abstract / PDF
SALSA: Analyzing Logs as StAte Machines.SALSA: Analyzing Logs as StAte Machines. Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Analysis of System Logs (WASL), San Diego, CA (December 2008). Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-111. September 2008.
Abstract / PDF [630K]
RAMS and BlackSheep: Inferring White-box Application Behavior Using Black-box Techniques. Jiaqi Tan, Priya Narasimhan. School of Computer Science Senior Honors Thesis and Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-103, May, 2008.
Abstract / PDF [1.7M]
ASDF: Automated, Online Fingerpointing for Hadoop. Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-104. May 2008.
Abstract / PDF [650K]
Fingerpointing Correlated Failures in Replicated Systems. Soila Pertet, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Abstract / PDF [100K]
Towards Fingerpointing in the Emulab Dynamic Distributed System. Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau. Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA. Nov. 5, 2006.
Abstract / PDF [311K]
Group Communication: Helping or Obscuring Failure Diagnosis? Soila Pertet, Rajeev Gandhi and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-107, June, 2006.
Abstract / PDF [ 591K]
Causes of Failure in Web Applications. Soila Pertet and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-05-109. December 2005.
Abstract / PDF [367K]

Acknowledgements

We thank the members and companies of the PDL Consortium: Bloomberg LP, Datadog, Google, Intel Corporation, Jane Street, LayerZero Research, Meta, Microsoft Research, Oracle Corporation, Oracle Cloud Infrastructure, Pure Storage, Salesforce, Samsung Semiconductor Inc., and Western Digital for their interest, insights, feedback, and support.