Practical Experiences with Chronics Discovery in Large Telecommunications Systems
SLAML 2011, October 23, 2011, Cascais, Portugal.
Soila P. Kavulya*, Kaustubh Joshi^, Matti Hiltunen^, Scott Daniels^,
*Carnegie Mellon Universityy
^AT&T Labs - Research
Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chronics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information-theoretic measure of distance (KL divergence), to identify the attributes that best distinguish failed calls from successful calls. Our preliminary results demonstrate the usefulness of our technique by providing examples of actual instances where we helped operators discover and diagnose chronics.
FULL PAPER: pdf