Contact: Greg Ganger, Priya Narasimhan

Automating problem analysis is crucial to achieving maintainable systems at the scales needed for tomorrow's high-end computing. Our research explores methodologies and algorithms for automating analysis of failures and performance degradations in large-scale systems, such as distributed storage. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Fingerpointing is one approach to problem diagnosis that combines node-level (local) anomaly detection, followed by system-wide (global) detection.
By combining statistical tools with appropriate instrumentation, we hope to significantly reduce the difficulty of analyzing performance and reliability problems in deployed large-scale systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing. Obtaining meaningful results will involve understanding which and how well statistical tools work to meet the challenge of problem detection/prediction. It will also involve quantifying the impact of instrumentation detail on the effectiveness of those tools so as to guide justification for associated instrumentation costs. Explorations will be done primarily in the framework of the Ursa Minor/Major cluster-based storage systems via fault injection and analysis of case studies observed in deployment.



Chuck Cranor
Christos Faloutsos
Greg Ganger
Rajeev Gandhi
Priya Narasimhan
Bianca Schroeder (postdoc)
Alice Zheng (postdoc)


Gregg Economou
Michael Stroucken


Mike Kasick
Soila Pertet Kavulya
Raja Sambasivan
Mike Abd-El-Malek
Eno Thereska
Evan Hoke
Jimeng Sun
John StrunK



Thank you to Google for support via a Google research grant.

We thank the members and companies of the PDL Consortium: Alibaba Group, Amazon, Datrium, Facebook, Google, Hewlett Packard Enterprise, Hitachi Ltd., Intel Corporation, IBM, Micron, Microsoft Research, NetApp, Inc., Oracle Corporation, Salesforce, Samsung Semiconductor Inc., Seagate Technology, and Two Sigma for their interest, insights, feedback, and support.

This material is based on research sponsored in part by the National Science Foundation, via grants CCF-0621508 and CNS-0326453, by the Army Research Office,under agreement number DAAD19-02-1-0389, and by the Department of Energy under Award Number DE-FC02- 06ER25767.




© 2019. Legal Info.
Last updated 16 September, 2013