Parallel Data Laboratory

Efficient Multi-Site Data-Intensive Computing

Hybrid cloud deployments, combining on-premises data centers with public cloud resources, have become common as large organizations seek benefits like load bursting, more agile growth paths, and access to new technologies with smaller investments. The division of resources differs between organizations and can change over time. Any deployment of large-scale data analytics among sites requires careful partitioning of both data and computation to avoid massive networking costs.

Simplistic data partitioning without replication, when used with data-heavy workloads, leads to increased dollar costs from cloud-egress and on-premise networking requirements. Alternatively, extensive replication of data at each site reduces communication costs at the expense of significant storage capacity costs.

This project identifies opportunities for improvement by collecting and analyzing large-scale hybrid cloud analytics workloads. As a first step, Moirai* is an optimization framework that improves cost efficiency in hybrid cloud deployments. It formulates an MIP problem with the goal of minimizing dollar costs, including cloud-egress, replication storage, and network link provisioning, given operational constraints such as on-premises capacity and inter-site bandwidth limits. Unlike prior approaches, Moirai informs its optimization by online analysis of job logs and associated per-job data accesses. To scale to large infrastructures, a novel combination of optimization refinements is applied, reducing the problem space with characteristics seen in the logs: grouping jobs based on query template similarity, pruning tables not accessed in the last week, and pre-selecting the most commonly needed tables for replication. For job placement, Moirai optimizes recurring jobs and uses a per-table access-size predictor to minimize remote fetches for new jobs. Finally, Moirai adapts to dynamic workloads by periodically reevaluating placement decisions, accounting for shifting data, job patterns, and resource availability.

We evalatued Moirai on a workload of 66.7M Presto queries and Spark jobs accessing 13.3EB of data from a 300PB corpus, logged at Uber over four months. This workload has high interconnectivity, moderate job template recurrence, and continuous growth, creating a challenging optimization problem. By leveraging job-data dependencies, selective replication, and predictive job routing, Moirai reduced cost by over 97% compared to prior approaches (Figure 2) demonstrating Moirai’s ability to scale, adapt to change, and support efficient cloud migration.

--
* Moirai is named for the three Fates of Greek mythology—Spinner, Alloter, and Enactor—and Moirai reuses these names for its components.

Fig 1. Moirai architecture: Moirai dynamically and continuously optimizes job (circles)
and data (squares) placement in hybrid clouds, using a feedback loop.

Fig 2. Placement approach costs: Moirai reduces cost by 97% compared to Alibaba’s Yugong,
the state-of-the-art, when run on 4-month traces of Uber’s Presto and Spark services. "No Rep (Spotify)”
randomly partitions data with no replication—simple but costly for modern data-heavy workloads due to
cloud-egress and on-prem networking. “Rep 3Mon (Twitter)” replicates the most recent three months
across sites, reducing egress but substantially increasing storage cost

People

FACULTY

George Amvrosiadis
Greg Ganger

GRAD STUDENTS

Hojin Park
Ziyue Qiu

COLLABORATORS

Jing Zhao, Uber

Publications

Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds. Ziyue Qiu, Hojin Park, Jing Zhao, Yukai Wang, Arnav Balyan, Gurmeet Singh, Yangjun Zhang, Jack Song, Gregory R. Ganger, George Amvrosiadis. SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea.
Abstract / PDF [895K]
Reducing Cross-Cloud/Region Costs with the Auto-Configuring MACARON Cache. Hojin Park, Ziyue Qiu, Gregory R. Ganger, George Amvrosiadis. SOSP ’24, November 4–6, 2024, Austin, TX, USA.
Abstract / PDF [2.7M]

Code Downloads

Code
Anonymous traces

Acknowledgements

We thank the members and companies of the PDL Consortium: Bloomberg LP, Datadog, Google, Intel Corporation, Jane Street, LayerZero Research, Meta, Microsoft Research, Oracle Corporation, Oracle Cloud Infrastructure, Pure Storage, Salesforce, Samsung Semiconductor Inc., and Western Digital for their interest, insights, feedback, and support.