Problem Diagnosis in Distributed Systems

Contact: Priya Narasimhan

Distributed systems contain multiple hardware and software components that can interact across multiple nodes/subsystems in sometimes unforeseen and complicated ways. As a result, determining the root cause of failures in these systems can be a very frustrating experience that might take several hours or even days.

Problem diagnosis (or fingerpointing) involves instrumenting systems to yield meaningful data, detecting errors and/or failures within these systems, and ascertaining their root-cause, i.e., the underlying fault. Fingerpointing is difficult because the distributed interactions, protocols and inter-component dependencies in computer systems can cause a problem to change ``shape'' or manifestation, leading to potential red herrings in problem determination. There can be many root causes of an outward manifestation of a problem and there might be insufficient information to distinguish between the various root causes. On the other hand, too much monitoring and too many error messages might overwhelm the system, obscure the root cause, and lead to increased latencies and additional resource costs.

We are currently developing a variety of techniques for automated fingerpointing in distributed systems -- the aim is to perform online and offline root-cause analyses in order to identify a faulty node/process, diagnose the source of the problem, and report it to the user or administrator in a meaningful/useful manner.

We ultimately aim for a preemptive strategy (where we need not wait for any instability or problem to manifest into system-wide outage before taking remedial action) that might improve the system's overall responsiveness and availability. The idea is to observe the trends of various key metrics in the system to ascertain which of these can be good indicators of the overall health of the system, and which metrics (if monitored appropriately and at the right frequency) could herald a potential outage. Thus, our techniques aim for two key elements: diagnosis of the root cause of the problem, and (where possible) a proactive indication of an imminent critical problem in the system that averts a total system failure.



Priya Narasimhan
Rajeev Gandhi


Soila Pertet
Michael P. Kasick
Patrick E. Lanigan
Jiaqi Tan
Xinghao Pan
Keith Bare
Eugene Marinelli



We thank the members and companies of the PDL Consortium: Broadcom, Ltd., Citadel, Dell EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi Ltd., Intel Corporation, Microsoft Research, MongoDB, NetApp, Inc., Oracle Corporation, Samsung Information Systems America, Seagate Technology, Tintri, Two Sigma, Uber, Veritas and Western Digital for their interest, insights, feedback, and support.




© 2016. Last updated 8 March, 2012