Design and Evaluation of Highly Parallel, Reliable, and Available Disk Arrays

Contacts: Prof. G.A. Gibson, Prof. D.P. Siewiorek, Mark Holland,
	Daniel Stodolsky, Bill Courtright

Goals

This project merges last year's two new projects, "Highly Available Disk Arrays"
and "Architectures and Interfaces for Very Small Disks," with one of this thrust
area's original projects, "File Servers on High-Speed Local-Area Networks,"
and with related efforts arising from Prof. Gibson's collaboration with the
University of California at Berkeley and from the School of Computer Science
projects spawned from this collaboration. The common theme of all these
projects is their emphasis on disk arrays as the paradigm for storage in
modern computing systems. The resulting DSSC project better represents the
multifaceted disk array research being carried out in or in cooperation with
this thrust area. 

The broad goals of this project are to advance the state of the art of
redundant disk arrays. Redundant disk arrays are emerging as this decade's
dominant architecture for high performance, high reliability, cost effective
secondary storage. Non-redundant disk arrays provide the cost, volume, and
capacity of traditional disk subsystems, and, by leveraging parallelism, many
times their performance. Unfortunately, arrays of small disks may have much
higher failure rates than the single large disks they replace. Redundant disk
arrays use simple redundancy schemes to provide high data reliability. 

More specifically, this project has the following knowledge-based goals

o an understanding of performance, reliability, and availability in redundant
disk arrays, 

o design strategies and reconstruction algorithms for high availability,

o identification of and strategies for overcoming bottlenecks in highly
parallel, high performance redundant disk arrays, 

o identification of and strategies for overcoming bottlenecks in the interface
between high performance storage and high performance local-area networks, 

o experience with disk arrays that provide storage for I/O-intensive massively
parallel applications, and

o an understanding of the implications for massively parallel disk arrays of
the trend in magnetic disks to very small diameter disks, 

and the following technology-based goals

o measurement and modeling for performance, reliability, and availability in
redundant disk arrays,

o implementations of redundant disk array schemes including RAID levels 0, 1,
3, and 5, parity logging, and declustered parity,

o implementations of alternative reconstruction algorithms, 

o Nectar-attached disk arrays, each delivering 10 MB/sec each, collectively
providing storage for Nectar-based massively parallel applications, and

o an evaluated architecture for massive disk arrays based on very small
diameter magnetic disks.

Accomplishments

Berkeley RAID

Before April 1991, Gibson was wholly occupied at Berkeley with dissertation
research and contributing to the design of Berkeley's prototype
network-attached massive disk array, RAID-II. After arriving at Carnegie
Mellon and beginning work with the DSSC, Gibson lead a year long (July 91
through June 92) Input/Output Systems Seminar based on his dissertation
research. This seminar series energized computer science student interest in
storage systems and directly lead to the parity logging idea described below.
Since coming to CMU, Gibson has continued to cooperate with Berkeley,
publishing his dissertation with MIT Press [Gibson91], publishing a paper on
designing disk arrays for high reliability [Gibson93], contributing to a
submission on the architecture of high performance disk arrays [Lee93], and
contributing to a survey article in preparation.

Highly Available Disk Arrays

Some applications, notably on-line transaction processing (OLTP), require
storage systems that offer high throughput for many, parallel small random
accesses, low-cost storage, high data reliability, and high data availability.
Standard redundant disk array organizations do not provide sufficiently high
availability because servicing requests in an array with a failed disk doubles
the load applied to the disks and servicing requests during a subsequent disk
recovery more than doubles the load. In this case, either user access to data
is greatly reduced for minutes (at least) or recovery must be done so slowly
that data reliability is greatly reduced. In this research, we have focused on
balancing cost against data reliability and performance during failure
recovery in highly-available parity-based arrays for use in
continuous-operation systems. We have employed a parity encoding scheme called
"parity declustering" that improves on standard parity organizations by
reducing the additional load on surviving disks during the reconstruction of a
failed disk's contents. This yields higher user throughput during recovery,
and/or shorter recovery time.

Working in a detailed simulation environment [Lee91], we have developed a
flexible, efficient implementation of declustered parity based on balanced
incomplete and complete block designs [Holland92]. We then evaluated this
implementation under a highly concurrent workload comprised of small user
accesses. We show that declustered parity penalizes user response time while a
disk is being repaired (before and during its recovery) less than comparable
non-declustered (RAID 5) organizations without any penalty to user response
time in the fault-free state. In particular, we show how to trade increased
redundancy (cost) for decreased user response time penalties during recovery.
Further, we have examined previously proposed optimization algorithms
[Muntz90] for declustered parity disk arrays and found that, contrary to
previous suggestions, these algorithms may actually penalize performance
during recovery in many arrays. 

The results of this declustered parity disk arrays implementation and
evaluation effort lead us to pursue recovery algorithm mechanisms for further
controlling the tradeoff between on-line performance during recovery and data
reliability. We have evaluated the differences between recovery based on one
or more threads reconstructing one "stripe" of disk array data at a time and
recovery based on one or more threads collecting data from or delivering
reconstructed data to individual disks. The latter "disk oriented" algorithm
requires a shared buffer space and distributed reconstruction, but it yields
substantially better utilization of array disks during reconstruction
[Holland93].

Parity Logging: Overcoming the Small Write Bottleneck

This research focuses on another weakness in standard redundant disk arrays
for workloads that feature a large number of small random writes such as OLTP.
The problem is that each time data is changed, the corresponding parity
encoding must also be updated. If the update is large enough and appropriately
aligned so that all data associated with an effected parity block is changed,
then the new parity can be computed in memory. In this case the parity update
overhead is just the additional write for the new parity. If the update is
small, however, then quite often a data update will require that the old data
be read from the disk before it is overwritten, the bits that will be changed
be computed, the old parity be read, the corresponding bits be changed and
then the new parity be written. This means that a write of a single data block
causes four disk blocks to be accessed, two read and two written. This cost
has been called the "small write problem" or the "small write bottleneck."

The most powerful solution to this problem is called a
"log-structured filesystem" [Rosenblum91]. It works by dynamically remapping
the disk storage location of each user's data so that all recently written
data is located together. Then by delaying the actual write accesses, enough
data can be collected so that all writes are large enough to be efficient.
However, this approach requires extensive changes in the operating system and
it can lead to very poor performance if the read access patterns are not the
same as the write access patterns. In particular, if data is written in small
random blocks but read in large sequential blocks (such as report generation),
then any smart disk allocation provided by the user (or data manager) is
ineffective and low performance may result.

We have developed a novel solution that preserves the user's disk allocation
decisions while reducing the cost of a data write to slightly larger than two
disk accesses or even as low as slightly larger than one disk access
[Stodolsky93]. Our solution is based on delaying only the application of
changes to parity blocks, logging these changes in large blocks to
"log space," then applying the changes much later when enough work is
available for efficient group updates.

We have developed an analytic model for comparing our approach, called
"parity logging," to the standard RAID level 5, traditional mirroring,
proposed floating parity [Menon92], and non-redundant approaches (we plan to
extend our work with a direct comparison to the log-structured filesystem
approach). For a strawman disk array, in the best case, our scheme achieves 36
small random writes per second per disk while RAID level 5 delivers only 18,
floating parity delivers 23, and mirroring delivers 25 (for comparison, these
disks can do up to 50 I/Os per second if no redundancy is maintained). We have
also implemented these redundancy schemes in the simulation platform used by
the highly available disk arrays research and found that our analytic model
error is less than 5%. An additional bonus of the parity logging approach is
that when higher levels of data protection are employed (tolerating all double
or triple disk failures) the rate of increase in performance overhead is much
lower than in the other schemes. This work has the potential to be very
important to the disk array marketplace.

Nectar-attached Disk Array

This effort is derived from one of the longest standing projects in this
thrust area. We have always recognized the need for an efficient, effective
interface between high performance storage systems and the high performance
networks that deliver data between storage servers and client machines.
During 1990 and 1991 the FDDI-based, 100 Mbits/sec, prototype Nectar network
[Arnould89, Kung91] was used to explore the bottlenecks in a standard,
workstation-based UNIX file server delivering data to a remote machine. We
found that to drive a single disk at greater than 50% utilization a large
block size was needed. With 32 KB blocks we achieved 95% utilization for large
image accesses. When we applied similar tests over the network, prototype
Nectar did not slow data transfers while Ethernet limited the disk utilization
to under 30%. We also identified data copying in the server as a major source
of performance loss.

The next phase of our research was to attach disk arrays to nodes in the
faster, HIPPI-based Gigabit Nectar implementation. Gigabit Nectar is expected
to be ready for workstation use by the summer of 1993. As we intend to use
commercial disk arrays for this platform and because such products are rapidly
improving at this time, we are delaying acquisition and will put our disk
array in place at about the same time as Nectar becomes available.

Architectures and Interfaces for Very Small Disks

This effort focuses on the challenges in exploiting disks of rapidly
decreasing magnetic diameters as components in massively parallel disk arrays
for ever faster supercomputers. 

This portion of the study of disk arrays has not come up to speed. We have yet
to identify a student for this work. Consultation between our disk array
systems specialists and magnetic disk technologists are underway. We expect to
make progress with this project in 1993, hopefully locating a student by the
end of the year. 

Plans

We proceed in this project in many directions at once. Our cooperation with
Berkeley continues with emphasis on understanding the design and performance
of their RAID-II disk array controller prototype. Our effort in highly
available disk arrays is focussing on recovery algorithms that change their
foci of reconstruction to track intervening user requests. We will also be
looking at the implications of parity logging on recovery performance and data
reliability. The parity logging effort will begin by broadening the range of
workloads to which it can be profitably applied and extending its modelling
and simulation to understand its advantages and disadvantages in comparison to
log-structured filesystems. As Gigabit Nectar comes on-line we will be putting
alot of effort into Nectar-attached storage systems. We will use commercial
products as much as possible, except for software control where we hope to
implement our recovery algorithms, parity logging redundancy control scheme,
and prefetching file system design. We will also return to an examination of
file system and operating system overhead for much faster storage servers on
much faster networks. Finally, much work remains to be done in our small disks
for massive disk arrays effort. Primarily we need to find a student for this
project who will look at the interfaces and architectures for these small
diameter disks.