Design and Evaluation of Highly Parallel, Reliable, and Available Disk Arrays Contacts: Prof. G.A. Gibson, Prof. D.P. Siewiorek, Mark Holland, Daniel Stodolsky, Bill Courtright Goals This project merges last year's two new projects, "Highly Available Disk Arrays" and "Architectures and Interfaces for Very Small Disks," with one of this thrust area's original projects, "File Servers on High-Speed Local-Area Networks," and with related efforts arising from Prof. Gibson's collaboration with the University of California at Berkeley and from the School of Computer Science projects spawned from this collaboration. The common theme of all these projects is their emphasis on disk arrays as the paradigm for storage in modern computing systems. The resulting DSSC project better represents the multifaceted disk array research being carried out in or in cooperation with this thrust area. The broad goals of this project are to advance the state of the art of redundant disk arrays. Redundant disk arrays are emerging as this decade's dominant architecture for high performance, high reliability, cost effective secondary storage. Non-redundant disk arrays provide the cost, volume, and capacity of traditional disk subsystems, and, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant disk arrays use simple redundancy schemes to provide high data reliability. More specifically, this project has the following knowledge-based goals o an understanding of performance, reliability, and availability in redundant disk arrays, o design strategies and reconstruction algorithms for high availability, o identification of and strategies for overcoming bottlenecks in highly parallel, high performance redundant disk arrays, o identification of and strategies for overcoming bottlenecks in the interface between high performance storage and high performance local-area networks, o experience with disk arrays that provide storage for I/O-intensive massively parallel applications, and o an understanding of the implications for massively parallel disk arrays of the trend in magnetic disks to very small diameter disks, and the following technology-based goals o measurement and modeling for performance, reliability, and availability in redundant disk arrays, o implementations of redundant disk array schemes including RAID levels 0, 1, 3, and 5, parity logging, and declustered parity, o implementations of alternative reconstruction algorithms, o Nectar-attached disk arrays, each delivering 10 MB/sec each, collectively providing storage for Nectar-based massively parallel applications, and o an evaluated architecture for massive disk arrays based on very small diameter magnetic disks. Accomplishments Berkeley RAID Before April 1991, Gibson was wholly occupied at Berkeley with dissertation research and contributing to the design of Berkeley's prototype network-attached massive disk array, RAID-II. After arriving at Carnegie Mellon and beginning work with the DSSC, Gibson lead a year long (July 91 through June 92) Input/Output Systems Seminar based on his dissertation research. This seminar series energized computer science student interest in storage systems and directly lead to the parity logging idea described below. Since coming to CMU, Gibson has continued to cooperate with Berkeley, publishing his dissertation with MIT Press [Gibson91], publishing a paper on designing disk arrays for high reliability [Gibson93], contributing to a submission on the architecture of high performance disk arrays [Lee93], and contributing to a survey article in preparation. Highly Available Disk Arrays Some applications, notably on-line transaction processing (OLTP), require storage systems that offer high throughput for many, parallel small random accesses, low-cost storage, high data reliability, and high data availability. Standard redundant disk array organizations do not provide sufficiently high availability because servicing requests in an array with a failed disk doubles the load applied to the disks and servicing requests during a subsequent disk recovery more than doubles the load. In this case, either user access to data is greatly reduced for minutes (at least) or recovery must be done so slowly that data reliability is greatly reduced. In this research, we have focused on balancing cost against data reliability and performance during failure recovery in highly-available parity-based arrays for use in continuous-operation systems. We have employed a parity encoding scheme called "parity declustering" that improves on standard parity organizations by reducing the additional load on surviving disks during the reconstruction of a failed disk's contents. This yields higher user throughput during recovery, and/or shorter recovery time. Working in a detailed simulation environment [Lee91], we have developed a flexible, efficient implementation of declustered parity based on balanced incomplete and complete block designs [Holland92]. We then evaluated this implementation under a highly concurrent workload comprised of small user accesses. We show that declustered parity penalizes user response time while a disk is being repaired (before and during its recovery) less than comparable non-declustered (RAID 5) organizations without any penalty to user response time in the fault-free state. In particular, we show how to trade increased redundancy (cost) for decreased user response time penalties during recovery. Further, we have examined previously proposed optimization algorithms [Muntz90] for declustered parity disk arrays and found that, contrary to previous suggestions, these algorithms may actually penalize performance during recovery in many arrays. The results of this declustered parity disk arrays implementation and evaluation effort lead us to pursue recovery algorithm mechanisms for further controlling the tradeoff between on-line performance during recovery and data reliability. We have evaluated the differences between recovery based on one or more threads reconstructing one "stripe" of disk array data at a time and recovery based on one or more threads collecting data from or delivering reconstructed data to individual disks. The latter "disk oriented" algorithm requires a shared buffer space and distributed reconstruction, but it yields substantially better utilization of array disks during reconstruction [Holland93]. Parity Logging: Overcoming the Small Write Bottleneck This research focuses on another weakness in standard redundant disk arrays for workloads that feature a large number of small random writes such as OLTP. The problem is that each time data is changed, the corresponding parity encoding must also be updated. If the update is large enough and appropriately aligned so that all data associated with an effected parity block is changed, then the new parity can be computed in memory. In this case the parity update overhead is just the additional write for the new parity. If the update is small, however, then quite often a data update will require that the old data be read from the disk before it is overwritten, the bits that will be changed be computed, the old parity be read, the corresponding bits be changed and then the new parity be written. This means that a write of a single data block causes four disk blocks to be accessed, two read and two written. This cost has been called the "small write problem" or the "small write bottleneck." The most powerful solution to this problem is called a "log-structured filesystem" [Rosenblum91]. It works by dynamically remapping the disk storage location of each user's data so that all recently written data is located together. Then by delaying the actual write accesses, enough data can be collected so that all writes are large enough to be efficient. However, this approach requires extensive changes in the operating system and it can lead to very poor performance if the read access patterns are not the same as the write access patterns. In particular, if data is written in small random blocks but read in large sequential blocks (such as report generation), then any smart disk allocation provided by the user (or data manager) is ineffective and low performance may result. We have developed a novel solution that preserves the user's disk allocation decisions while reducing the cost of a data write to slightly larger than two disk accesses or even as low as slightly larger than one disk access [Stodolsky93]. Our solution is based on delaying only the application of changes to parity blocks, logging these changes in large blocks to "log space," then applying the changes much later when enough work is available for efficient group updates. We have developed an analytic model for comparing our approach, called "parity logging," to the standard RAID level 5, traditional mirroring, proposed floating parity [Menon92], and non-redundant approaches (we plan to extend our work with a direct comparison to the log-structured filesystem approach). For a strawman disk array, in the best case, our scheme achieves 36 small random writes per second per disk while RAID level 5 delivers only 18, floating parity delivers 23, and mirroring delivers 25 (for comparison, these disks can do up to 50 I/Os per second if no redundancy is maintained). We have also implemented these redundancy schemes in the simulation platform used by the highly available disk arrays research and found that our analytic model error is less than 5%. An additional bonus of the parity logging approach is that when higher levels of data protection are employed (tolerating all double or triple disk failures) the rate of increase in performance overhead is much lower than in the other schemes. This work has the potential to be very important to the disk array marketplace. Nectar-attached Disk Array This effort is derived from one of the longest standing projects in this thrust area. We have always recognized the need for an efficient, effective interface between high performance storage systems and the high performance networks that deliver data between storage servers and client machines. During 1990 and 1991 the FDDI-based, 100 Mbits/sec, prototype Nectar network [Arnould89, Kung91] was used to explore the bottlenecks in a standard, workstation-based UNIX file server delivering data to a remote machine. We found that to drive a single disk at greater than 50% utilization a large block size was needed. With 32 KB blocks we achieved 95% utilization for large image accesses. When we applied similar tests over the network, prototype Nectar did not slow data transfers while Ethernet limited the disk utilization to under 30%. We also identified data copying in the server as a major source of performance loss. The next phase of our research was to attach disk arrays to nodes in the faster, HIPPI-based Gigabit Nectar implementation. Gigabit Nectar is expected to be ready for workstation use by the summer of 1993. As we intend to use commercial disk arrays for this platform and because such products are rapidly improving at this time, we are delaying acquisition and will put our disk array in place at about the same time as Nectar becomes available. Architectures and Interfaces for Very Small Disks This effort focuses on the challenges in exploiting disks of rapidly decreasing magnetic diameters as components in massively parallel disk arrays for ever faster supercomputers. This portion of the study of disk arrays has not come up to speed. We have yet to identify a student for this work. Consultation between our disk array systems specialists and magnetic disk technologists are underway. We expect to make progress with this project in 1993, hopefully locating a student by the end of the year. Plans We proceed in this project in many directions at once. Our cooperation with Berkeley continues with emphasis on understanding the design and performance of their RAID-II disk array controller prototype. Our effort in highly available disk arrays is focussing on recovery algorithms that change their foci of reconstruction to track intervening user requests. We will also be looking at the implications of parity logging on recovery performance and data reliability. The parity logging effort will begin by broadening the range of workloads to which it can be profitably applied and extending its modelling and simulation to understand its advantages and disadvantages in comparison to log-structured filesystems. As Gigabit Nectar comes on-line we will be putting alot of effort into Nectar-attached storage systems. We will use commercial products as much as possible, except for software control where we hope to implement our recovery algorithms, parity logging redundancy control scheme, and prefetching file system design. We will also return to an examination of file system and operating system overhead for much faster storage servers on much faster networks. Finally, much work remains to be done in our small disks for massive disk arrays effort. Primarily we need to find a student for this project who will look at the interfaces and architectures for these small diameter disks.