PDL Abstract

Automated Diagnosis of Chronic Performance

Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation. CMU-PDL-13-109, May 2013.

Soila P. Kavulya

Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

Large production systems are susceptible to chronic performance problems where the system still works, but with degraded performance. Chronic performance problems occur intermittently or affect a subset of end-users. Traditional approaches for diagnosis typically rely on a bottom-up approach that localizes problems by correlating low-level alarms (such as resource utilization indicators or network packet loss) across components in a production system. However, these alarm-correlation approaches fall short when diagnosing chronics because they fail to provide the necessary application-level visibility to detect chronics effectively. Due to the scale and complexity of production systems, there can be multiple unresolved chronics at any given time--their symptoms often overlap with each other, and they are sometimes triggered by complex corner cases.

This dissertation presents a top-down diagnostic framework for diagnosing chronic performance problems in production systems. The framework comprises of four components. First, an extensible log-analysis framework that extracts end-to-end causal flows using common white-box (i.e., application) logs in the production system; these end-to-end flows capture the user's experience with the system. Second, anomaly-detection tools exploit heuristics and a peer-comparison approach to label each end-to-end flow as successful or failed. Third, a top-down statistical diagnostic tool combines white-box metrics with blackbox metrics (e.g., CPU usage) to localize the source of the problem by identifying attributes that are more correlated with failed flows than successful ones. Fourth, a visualization tool that uses peer-comparison to highlight anomalous nodes in a parallel-computing cluster. The diagnostic framework has been used to localize real incidents at an academic cloudcomputing cluster that runs the Hadoop parallel-processing framework, and a production Voice-over-IP system at a major Internet Services Provider. Our approach is not limited to these two systems and is applicable to systems such as Internet Services that serve users via independent interactions.