PARALLEL DATA LAB 

PDL Abstract

Automated Diagnosis of Chronic Performance

Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation. CMU-PDL-13-109, May 2013.

Soila P. Kavulya

Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

Large production systems are susceptible to chronic performance problems where the system still works, but with degraded performance. Chronic performance problems occur intermittently or affect a subset of end-users. Traditional approaches for diagnosis typically rely on a bottom-up approach that localizes problems by correlating low-level alarms (such as resource utilization indicators or network packet loss) across components in a production system. However, these alarm-correlation approaches fall short when diagnosing chronics because they fail to provide the necessary application-level visibility to detect chronics effectively. Due to the scale and complexity of production systems, there can be multiple unresolved chronics at any given time--their symptoms often overlap with each other, and they are sometimes triggered by complex corner cases.

This dissertation presents a top-down diagnostic framework for diagnosing chronic performance problems in production systems. The framework comprises of four components. First, an extensible log-analysis framework that extracts end-to-end causal flows using common white-box (i.e., application) logs in the production system; these end-to-end flows capture the user's experience with the system. Second, anomaly-detection tools exploit heuristics and a peer-comparison approach to label each end-to-end flow as successful or failed. Third, a top-down statistical diagnostic tool combines white-box metrics with blackbox metrics (e.g., CPU usage) to localize the source of the problem by identifying attributes that are more correlated with failed flows than successful ones. Fourth, a visualization tool that uses peer-comparison to highlight anomalous nodes in a parallel-computing cluster. The diagnostic framework has been used to localize real incidents at an academic cloudcomputing cluster that runs the Hadoop parallel-processing framework, and a production Voice-over-IP system at a major Internet Services Provider. Our approach is not limited to these two systems and is applicable to systems such as Internet Services that serve users via independent interactions.

FULL DISSERTATION: pdf