PARALLEL DATA LAB 

PDL Abstract

Light-weight Black-box Failure Detection for Distributed Systems

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-107. July 2012.

Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan

Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

Diagnosing failures in distributed systems is challenging, as modern datacenters run a variety of applications and systems. Current techniques for detecting failures often require training, have limited scalability, or are not intuitive to sysadmins. We present LFD, a lightweight and scalable technique for diagnosing performance problems in distributed systems using only correlations of operating system metrics collected transparently. The LFD fault detection algorithm is based on our hypothesis of server application behavior, and hence does not require training, and can perform failure detection with complexity linear in the number of nodes, with results that are intuitively interpretable by sysadmins. Further, with some training, LFD-DT uses decision-trees to diagnose the category of a problem that has previously been seen. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.

KEYWORDS: MapReduce, Web Applications, Diagnosis, Correlation, Decision-Trees

FULL TR: pdf