DATE: Thursday, November 4, 2004
TIME: Noon - 1 pm
PLACE: Hamerschlag Hall D-210

Armando Fox
Stanford University

Recovery as Rapid Adaptation:
Combining Fast Microrecovery with Statistical Monitoring

We began the Recovery-Oriented Computing (ROC) project with the goal of increasing Internet server availability by reducing time to recovery. Building on the observation that rebooting or restarting is a well-known and simple form of recovery that returns systems or subsystems to a"clean slate", we proposed to design systems specifically so that the only shutdown method is crashing and the only recovery method is fast reboot; we called this approach crash-only software. Having designed three crash-only systems, we find that cheap recovery, while indeed good for its own sake in improving availability, also enables"micro-recovery" as a first line of defense: rather than complex error unwinding, coerce any observed error to a (micro-)crash, then (micro-)recover. If micro-recovery is sufficiently cheap in performance and does not impact correctness, there's no reason to avoid trying it first, even if it does not always solve the problem. This in turn enables the use of automated aggressive detection techniques that have nontrivial false positive rates, or equivalently, to deploy multiple overlapping detectors/alarms in order to be conservative. Fast cheap micro-recovery also allows more liberal use of rejuvenation, such as so-called "rolling reboots", without worrying about when is the "best" time to do it. We have also found that cheap recovery also allows some maintenance operations such as incremental scaling of storage to be recast as failure plus recovery, exploiting the same mechanisms as recovery to achieve online scaling without service interruption.

In this talk I'll describe highlights and design lessons from three crash-only systems we've built, including experiments using statistical anomaly detection techniques (with nontrivial false positive rates) as a complementary monitoring strategy. I'll also discuss how this approach might provide a scientific basis for designing tolerant applications in the face of imperfect detection and localization techniques.

More at and

Armando Fox ( has been an Assistant Professor at Stanford since January 1999. He has focused on improving system dependability through fast recovery, and was listed among the "Scientific American 50" of 2003 for his work in that area. Prof. Fox has also received teaching awards from the Associated Students of Stanford University Teaching, Tau Beta Pi, and the Society of Women Engineers. His other degrees in EECS are from MIT and the University of Illinois.

Host: David Garlan

For Further Seminar Info Contact:
or visit