Speaker: Elizabeth Borowsky, Storage System Program, HP Labs
Date: March 3, 1999
Architecting for earthquakes: fault tolerant storage
In today's information-centric global marketplace, business critical data must be available all the time. Rain or shine, earthquake in California or tornado in Kansas, the data center needs to be on-line meeting quality of service guarantees twenty-four hours a day, seven days a week, with no exceptions. This dire need for continual high performance and high availability is best met by an active system that can monitor, diagnose, and repair itself on the fly. Key design features are on-line load balancing, fluid scalability, and automatic recovery from failure.
In the Palladio project at HP Labs we are designing a fault tolerant distributed storage system to meet these goals. The advent of high speed back-end storage networks enables distributed storage to be accessed as quickly as a local disk. These SANs (storage attached networks) facilitate access to large pools of heterogeneous storage by multiple hosts. In order to achieve automatic load balancing and fault tolerance, our approach is to hide the details of the underlying data placement from the hosts. The upper layers of the system see only a virtual store abstraction of the data. Maintaining this abstraction, while still providing quality of service guarantees and data coherency, is not trivial. This is especially true in the face of network partitions, device crashes, and failures in the storage management system. In this talk I will discuss the architectural choices we've made in designing the storage management system, and present in detail our solution to disaster recovery. I will prove the liveness and safety properties of the recovery protocol. Namely, I will show the system eventually recovers from the failure (if at all possible) and that data coherency is guaranteed throughout.
Elizabeth Borowsky is a member of the Storage Systems program at HP Labs in Palo Alto, California. She spends her time figuring out how to automatically (and optimally) configure storage, and how to design distributed storage management systems that can withstand arbitrary failures. Prior to joining HP in 1995, Liz was at UCLA completing her PhD in computer science. In her thesis she gave a characterization of which tasks were solvable (or unsolvable) in asynchronous, possible faulty, shared memory distributed systems.