PDL Abstract

A Fault Model for Upgrades in Distributed Systems

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-115, December 2008.

Tudor Dumitraş, Soila Kavulya, Priya Narasimhan

Parallel Data Laboratory
School of Computer Science & Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213


Recent studies, and a large body of anecdotal evidence, suggest that upgrades are unreliable and often end in failure, causing downtime and data-loss. While this is sometimes due to software defects in the new version, most upgradefailures are the result of faults in the upgrade procedure, such as broken dependencies. In this paper, we present data on upgrade failures from three independent sources — a user study, a survey and a field study — and, through statistical cluster analysis, we construct a novel fault model for upgrades in distributed systems. We identify four distinct types of faults: (1) simple configuration errors (e.g., typos); (2) semantic configuration errors (e.g., misunderstood effects of parameters); (3) broken environmental dependencies (e.g., incorrect libraries, port conflicts); and (4) complex procedural errors. We estimate that, on average, Type 1 faults occur in 15.2 % of upgrades, and Type 4 faults occur in 16.8 % of upgrades.

FULL TR: pdf