Light-weight Black-box Failure Detection for Distributed SystemsCarnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-107. July 2012.
Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Diagnosing failures in distributed systems is challenging, as modern datacenters run a variety of applications and systems. Current techniques for detecting failures often require training, have limited scalability, or are not intuitive to sysadmins. We present LFD, a lightweight and scalable technique for diagnosing performance problems in distributed systems using only correlations of operating system metrics collected transparently. The LFD fault detection algorithm is based on our hypothesis of server application behavior, and hence does not require training, and can perform failure detection with complexity linear in the number of nodes, with results that are intuitively interpretable by sysadmins. Further, with some training, LFD-DT uses decision-trees to diagnose the category of a problem that has previously been seen. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.
KEYWORDS: MapReduce, Web Applications, Diagnosis, Correlation, Decision-Trees
FULL TR: pdf