Behavior-Based Problem Localization for Parallel File Systems
HotDep '10. October 3, 2010, Vancouver, BC, Canada.
Michael P. Kasick, Rajeev Gandhi, Priya Narasimhan
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation — CPU instruction- pointer samples and function-call traces—to localize the faulty server and to enable root-cause analysis of the resource at fault. We validate our approach by injecting realistic storage and network problems into three different workloads (dd, IO-zone, and PostMark) on a PVFS cluster.
FULL PAPER: pdf