HotDep '10. October 3, 2010, Vancouver, BC, Canada.
Michael P. Kasick, Rajeev Gandhi, Priya Narasimhan
Parallel Data Laboratory
                      Carnegie Mellon University
                      Pittsburgh, PA 15213
                    
We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation — CPU instruction- pointer samples and function-call traces—to localize the faulty server and to enable root-cause analysis of the resource at fault.  We validate our approach by injecting realistic storage and network problems into three different workloads (dd, IO-zone, and  PostMark) on a PVFS cluster.
                    
FULL PAPER: pdf