Fingerpointing Correlated Failures in Replicated Systems
USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Soila Pertet, Rajeev Gandhi and Priya Narasimhan
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
Replicated systems are often hosted over underlying group communication protocols that provide totally ordered,
reliable delivery of messages. In the face of a performance problem at a single node, these protocols can cause correlated performance degradations at even non-faulty nodes, leading to potential red herrings in failure diagnosis. We propose a fingerpointing approach that combines node-level (local) anomaly detection, followed by system-wide (global) fingerpointing. The local anomaly detection relies on threshold-based analyses of system metrics, while global fingerpointing is based on the hypothesis that the root-cause of the failure is the node with an “odd-man-out” view of the anomalies. We compare the results of applying three classifiers – a heuristic algorithm, an unsupervised learner (k-means clustering), and a supervised learner (k-nearest-neighbor) – to fingerpoint the faulty node.
FULL PAPER: pdf