PARALLEL DATA LAB 

PDL Abstract

Black-Box Problem Diagnosis in Parallel File Systems

Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST '10),
San Jose, CA, February 2010.

Michael P. Kasick, Jiaqi Tan*, Rajeev Gandhi, Priya Narasimhan

Parallel Data Laboratory
School of Computer Science & Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

*DSO National Laboratories Singapore

http://www.pdl.cmu.edu/

We focus on automatically diagnosing different performance problems in parallel file systems by identifying, gathering and analyzing OS-level, black-box performance metrics on every node in the cluster. Our peer-comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis procedure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post-Mark), in both PVFS and Lustre clusters.

FULL TR: pdf