PDL Abstract

Making Problem Diagnosis Work for Large-Scale, Production Storage Systems

Proceedings of the 27th Large Installation System Administration Conference (LISA '13), Washington, DC, November 2013.

Michael P. Kasick, Priya Narasimhan, Kevin Harms*

Carnegie Mellon University
*Argonne National Laboratory

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-filesystem cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.