ABSTRACT

    USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).

    Fingerpointing Correlated Failures in Replicated Systems

    Soila Pertet, Rajeev Gandhi and Priya Narasimhan

    Parallel Data Laboratory
    Carnegie Mellon University
    Pittsburgh, PA 15213

    http://www.pdl.cmu.edu/

    Replicated systems are often hosted over underlying group communication protocols that provide totally ordered,
    reliable delivery of messages. In the face of a performance problem at a single node, these protocols can cause correlated performance degradations at even non-faulty nodes, leading to potential red herrings in failure diagnosis. We propose a fingerpointing approach that combines node-level (local) anomaly detection, followed by system-wide (global) fingerpointing. The local anomaly detection relies on threshold-based analyses of system metrics, while global fingerpointing is based on the hypothesis that the root-cause of the failure is the node with an “odd-man-out” view of the anomalies. We compare the results of applying three classifiers – a heuristic algorithm, an unsupervised learner (k-means clustering), and a supervised learner (k-nearest-neighbor) – to fingerpoint the faulty node.

    FULL PAPER: pdf

    PDL Home Publications Home

    © 2008.
    Last updated 3 April, 2007