Thursday, February 8 , 2007
In this talk, I give two examples of how statistical machine learning algorithms, along with appropriate instrumentation, can aid in failure diagnosis. The first example is an automatic software debugger that collects information from past successes and failures to locate suspicious program predicates. The data is obtained via fine-grained instrumentation of the program. We demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates in several real-world programs.
The second example comes from performance diagnosis in a distributed file system. We obtain snapshots of the system that contain coarse-grained traces of each file access request. We show that standard clustering techniques can separate requests into meaningful categories and pinpoint the key differences between snapshots.
Work on the software debugger done in collaboration with Ben Liblit (U. Wisconsin, Madison), Michael Jordan (U.C. Berkeley), Alex Aiken and Mayur Naik (Stanford). Work on performance diagnosis is a collaborative effort with Raja Sambasivan and Greg Ganger (CMU).