ABSTRACT

    Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA.
    November. 5, 2006.

    Towards Fingerpointing in the Emulab Dynamic Distributed System

    Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau

    Parallel Data Laboratory
    Carnegie Mellon University
    Pittsburgh, PA 15213

    http://www.pdl.cmu.edu/

    In the large-scale Emulab distributed system, the many failure reports make skilled operator time a scarce and costly resource, as shown by statistics on failure frequency and root cause.  We describe the lessons learned with error reporting in Emulab, along with the design, initial implementation, and results of a new local error-analysis approach that is running in production.  Through structured error reporting, association of context with each error-type, and propagation of both error-type and context, our new local analysis locates the most prominent failure at the procedure, script, or session level.  Evaluation of this local analysis for a targeted set of common Emulab failures suggests that this approach is generally accurate and will facilitate global fingerpointing, which will aim for reliable suggestions as to the root-cause of the failure at the system level.

    FULL PAPER: pdf


    PDL Home Publications Home

    © 2008.
    Last updated 13 September, 2006