ABSTRACT

Carnegie
Mellon University Parallel Data Lab Technical Report CMU-PDL-08-103, May
2008.
RAMS and BlackSheep: Inferring White-box
Application Behavior Using
Black-box Techniques
Submitted in partial fulfillment of the requirements for the Senior Honors Thesis
program in the School of Computer Science at Carnegie Mellon University
Jiaqi Tan, Priya Narasimhan
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
http://www.pdl.cmu.edu/
A significant challenge in developing automated problem-diagnosis tools for distributed systems is the ability of these tools to
differentiate between changes in system behavior due to workload changes from those due to faults. To address this challenge,
current, typically white-box, techniques extract semantically-rich knowledge about the target application through fairly invasive,
high-overhead instrumentation. We propose and explore two scalable, low-overhead, non-invasive techniques to infer semantics
about target distributed systems, in a black-box manner, to facilitate problem diagnosis. RAMS applies statistical analysis on
hardware performance counters to predict whether a given node in a distributed system is faulty, while BlackSheep corroborates
multiple system metrics with application-level logs to determine whether a given node is faulty. In addition, we have developed
and demonstrated a novel technique to extract, from existing application-level logs, semantically-rich behavior that is immediately
amenable to analysis and synthesis with other numerical, black-box metrics. We have evaluated the efficacy of RAMS and
BlackSheep in diagnosing real-world problems in the Hadoop distributed parallel programming system.
KEYWORDS: problem diagnosis, log analysis, distributed systems
FULL TR: pdf


©
2006.
Last updated
14 May, 2008
|