PARALLEL DATA LAB

EARLY HISTORY OF THE PDL

Since its inception, Carnegie Mellon's Parallel Data Lab (PDL) has pushed the state-of-the-art with new data system architectures, technologies, and design methodologies. PDL has grown a lot, from its early years, in breadth of research activities, in the number of CMU people involved, and in the number of industry and government sponsors and collaborators. This brief history focuses on PDL’s early years and a few notes about how things changed over time.

Dr. Garth Gibson founded the PDL in 1992 with 7 students from CMU's CS and ECE Departments. Having recently finished his Ph.D. research, which defined the industry standard RAID terminology for redundant disk arrays, Gibson guided PDL researchers in advanced disk array research. The name "Parallel Data Lab" came from this initial focus on parallelism in storage systems. In the PDL's formative years, its researchers developed technologies for improving failure recovery performance (parity declustering) and maximizing performance in small-write intensive workloads (parity logging). They also developed an aggressive prefetching technology (transparent informed prefetching, or TIP) for converting serial access patterns into highly parallel workloads capable of exploiting large disk arrays.

The first PDL Workshop and Retreat was held in October of 1993 with the goals of interacting with our industry sponsors, offering them a chance to get to know the PDL students and researchers, hear about their work, and provide the students a chance to get new feedback + ideas and develop relationships with prospective employers. At the first retreat, PDL research on disk arrays, parity logging and declustering, and other related CMU research was described by 20 CMU participants to 11 industry attendees from 6 sponsor companies. (The 2019 PDL Retreat had 43 industry attendees from 20 sponsor companies.) As is still the case, the first Retreat was highly interactive, allowing the sponsors to hear about and give feedback on PDL research, and offering the students a chance to develop relationships with future colleagues and potential employers.

One of the PDL students at the time recalls everyone wondering if they would have enough solid content to keep the industry attendees' attention throughout the 3-day retreat—of course it was not a problem. Every year since then, the difficult problem has been what to leave out, as the PDL researchers generate more cool ideas than will fit into the available time. The first PDL Retreat was held at the Hidden Valley Resort in Pennsylvania; for many years after that, we gathered at the Nemacolin Woodlands Resort, in Farmington, Pennsylvania; for the last several years, we have held the PDL Retreat at the Omni Bedford Springs Resort in Bedford, Pennsylvania.

The PDL first started using actual lab space at CMU (Wean Hall 3607), despite its name, in January of 1994. Since then, PDL has grown and its people have spread out, moving into and then out of Wean Hall 3606, Wean Hall 3701 (The Systems Architecture Lab) and various locales on D-Level of Hamerschlag Hall. Today, most of the PDL staff are located in the Robert Mehrabian Collaborative Innovation Center (RMCIC), as is the Data Center Observatory discussed below. PDL researchers have been fortunate to be able to work with substantial distributed systems and state-of-the-art equipment, over the years, thanks primarily to generous donations from and collaborations with sponsor companies.

From the beginning, the PDL logo has included Skibo Castle, Andrew Carnegie's summer home near Dornoch, Scotland. In the past, it has represented "a fortress of storage" (like a redundant disk array). More recently, it was also used during one broad project to represent a "fortress of security" (provided by self-securing devices). Perhaps, though, it is simply our vision of the ideal PDL Retreat venue.

PDL's initial seed funding came from CMU's Data Storage Systems Center (DSSC), then directed by Mark Kryder, and from DARPA (from which much PDL funding has come over the years). Additional funding came from the member companies of the PDL Consortium (the set of PDL sponsor companies), whose initial members were AT&T Global Information Systems, Data General, IBM, Hewlett-Packard, Seagate and Storage Technology. The list of government sponsors and PDL Consortium members varies over time, as new projects replace older ones and as companies merge or emerge.

In 1995, Gibson and Dr. David Nagle launched a new PDL project called Network-Attached Secure Disks (NASD). NASD was a new network-attached storage architecture for achieving cost-effective scalable bandwidth. In addition to their fundamental research advances, Gibson founded and chaired an industry working group within the Information Storage Industry Consortium (INSIC) to transfer the new technology and move towards standardization of the NASD architecture. In 1999, working group members produced a concrete proposal to launch an ANSI standards effort around object-based storage devices, essentially using the NASD architecture. Since then, the NASD project has stimulated much derivative research and development in academia and industry.

In 1999, we held our first annual Spring Industry Visit Day, a one day event that evolved as a result of requests from industry for more frequent interaction with PDL researchers. Opportunities for discussion during the open house revolve primarily around posters and demonstrations of the software and hardware prototypes developed during research, with less emphasis on formal spoken presentations.

In 1999, Nagle took over as PDL Director when Gibson went on leave. In 2000, Dr. Greg Ganger, who joined the ECE faculty and the PDL in 1997, jointly directed the PDL with Nagle, then became PDL Director in 2001 when Nagle left CMU to become an industry leader. Ganger has been PDL Director since.

In 2006 after several years of preparation, the PDL established the the Data Center Observatory (DCO) with over 2000 square feet of machine room space able to accommodate up to 40 compute/storage/ networking racks. In 2020, it is populated with over 1000 server machines. As well, there is a large NetApp filer provides robust core storage (1 PB), high-end GPU systems, and a lot of monitoring and instrumentation. As a data center space, it provides a computation and storage utility to resource-hungry research activities such as data analytics and ML, design simulation, ML systems, and distributed systems experimentation. As an observatory, it has provided invaluable real data to PDL researchers seeking to understand the sources of operational costs and to evaluate novel solutions.

In its lifetime, many faculty, staff, and students from both CS and ECE have been active members of the PDL. PDL has always been a very collaborative environment, both within and with companies. The group has grown to over 60 current CMU members (including staff, students and faculty), and research funding and output have seen similar growth in breadth as its focus (data storage and processing systems) grew broader and ever more central to modern computing. PDL's first Ph.D. graduate was Dr. Mark Holland (1994), who wrote his dissertation on 'On-Line Data Reconstruction in Redundant Disk Arrays.' Since then, dozens of PDL students have graduated with Ph.D.s, Masters Degrees, and undergraduate degrees, and many have moved on to employment with PDL Consortium companies. Some have even founded companies that then ended up becoming PDL Consortium members.

Over the years, PDL research has explored many aspects of data storage and processing. A list of current projects can be found here, and a list of previous projects can be found here.