SDI Seminar

Speaker: Peter M. Chen, University of Michigan

Date: February 12, 1998

Transparent, Low-Overhead Recovery for Distributed Applications


Over the past few years, the Rio project at Michigan has developed reliable main memory (the Rio file cache) and a fast transaction library (Vista). Reliable memory and free transactions provide new and interesting resources to use in building systems. In this talk, I describe how to use these resources to build recoverable, distributed applications.

Reliable main memory enables a simple model for distributed computing,which we call Messages in Local Transactions (MLT). A local transaction in the MLT model may update memory, send messages, and receive messages. By performing a set of these operations atomically and durably, the transaction mechanism keeps the message and local state consistent. Like other recovery schemes, MLT ensures that process failures always recover to a globally consistent state. However, MLT suffers from none of the drawbacks of other recovery schemes. Surviving processes are not involved in recovery (i.e. no roll back); processes do not coordinate or send extra messages during normal operation; and applications can be non-deterministic. We have implemented MLT as an extension to Vista (Vistagrams), and it adds negligible overhead to an existing protocol.

I also describe a checkpointing library (Free Checking) built on top of Vista and Vistagrams. Free Checking maps the entire process state (address space and registers) into Vista. By linking with the Free Checking library, an application becomes recoverable with few source code changes and little run-time overhead.

Bio: Peter M. Chen received a B.S. in Electrical Engineering from the Pennsylvania State University in 1987 and a M.S. and Ph.D. in Computer Science from the University of California at Berkeley in 1989 and 1992.

He is currently an Assistant Professor in the Department of Electrical Engineering and Computer Science at the University of Michigan at Ann Arbor. His research interests include operating systems, databases, and distributed systems and focus on improving the performance and reliability of computer storage systems.