The Parallel Data Lab at Carnegie Mellon University is academia's premiere data systems research center. An interdisciplinary group, its 50+ researchers come mainly from the Computer Science and ECE Departments. PDL also has a lot of friends in industry who generously provide us with advice, and some of the funding and equipment necessary to carry out our research.

PDL has an almost 30-year track record of research, education, and collaboration with industry. A brief description of PDL's genesis and early history can be found here.


Brief Research Overview

PDL research addresses a broad spectrum of data infrastructure challenges and opportunities, including scalability, efficiency, data reliability, emerging technologies, heterogeneity in systems, cloud computing, data lakes, and the intersection of ML and systems. Ongoing projects change over time, as some wrap up and others launch, often inspired by or in collaboration with PDL sponsor companies.


  • Big Learning (Systems for ML) - The BigLearning project aims at scaling machine learning to large and sophisticated models and huge data for average machine learning practitioners by developing programmable, distributed computing frameworks.
  • Caching to Improve Latency & Efficiency at Scale (CILES) - new designs that make both flash caching systems and content delivery caches more efficient, in terms of wear, tail latency, resiliency, cost-effectiveness, and of course hit rates.
  • Cost-efficient Computing in the Cloud - explores opportunities to exploit the various CSP VM instance offerings (e.g., best-fit VM sizes or low-cost but unreliable VMs rented on transient contracts) to reduce the costs to run user applications in the cloud.
  • Database Systems - experimental database systems exploring different aspects of storage and efficiency in large-scale databases.
  • Data Center Observatory (DCO) - a working data center and a research vehicle for the study of data center automation and efficiency.
  • Data Lake Scheduling - unearthing, analyzing, and exploiting hidden inter-job dependencies in data lakes (data analytics infrastructures) to better schedule jobs and manage resources.
  • DBMS Autotuning - autotuning Database Management Systems to improve performance and resolve problems.
  • DeltaFS - a new distributed file system service created to efficiently handle small chunk read/write data at exascale.
  • HeART - adaptive redundancy tuning to observed device failure rates in large distributed storage systems.
  • Mimir: Navigating cloud storage - helping users to make optimal decisions when composing distributed storage systems in the public cloud.
  • ML Coded Computation - more resilient computation via coding theory; performs computation F over all original and parity inputs in parallel, and decodes unavailable results of computation using the available results of computation from original and parity inputs.
  • NVM Redundancy - hardware and software approaches to providing memory-speed, storage-quality checksums and cross-chip parity.
  • Peloton - a relational database management system designed for fully autonomous optimization of hybrid workloads.
  • Zoned Storage - exploring the system and software impacts of this new interface, which redefines the division of responsibilities between storage software and device firmware.


, PDL Director
(412) 268-1297
RMCIC 2208

, PDL Executive Director
(412) 268-5485
RMCIC 2210

, PDL Administrative Manager
(412) 268-6716
RMCIC 2209

Mailing Address:

Parallel Data Lab
Carnegie Mellon University
4720 Forbes Avenue - RMCIC 2209
Pittsburgh, PA 15213-3891

Find Us

PDL's Visitor Information page.

The School of Computer Science's list of directions on how to find your way to and around CMU.