PARALLEL DATA LAB 

PDL Abstract

Behavior-Based Problem Localization for Parallel File Systems

HotDep '10. October 3, 2010, Vancouver, BC, Canada.

Michael P. Kasick, Rajeev Gandhi, Priya Narasimhan

Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation — CPU instruction- pointer samples and function-call traces—to localize the faulty server and to enable root-cause analysis of the resource at fault. We validate our approach by injecting realistic storage and network problems into three different workloads (dd, IO-zone, and PostMark) on a PVFS cluster.

FULL PAPER: pdf