Parallel Data Laboratory

Cost-efficient computing in cloud

Cloud service providers (CSPs) offer an effectively infinite (from most customers’ viewpoints) set of VM instances available for rental at fine time granularity. But, figuring out which instances to request is difficult, because each CSP offers diverse VM instance "types" primarily differentiated by their constituent hardware resources (e.g., core counts and memory sizes) and leasing contract models. Two primary types of contract models are reliable and transient (sometimes called "spot" or "preemptable"). Instances leased under a reliable contract are non-preemptible, while instances leased under a transient contract are made available on a best-effort basis at decreased cost and/or at lower priority.

This project explores opportunities to exploit the various CSP VM instance offerings (e.g., best-fit VM sizes or low-cost but unreliable VMs rented on transient contracts) to reduce the costs to run user applications in the cloud, while respecting application performance requirements. Explored applications include task scheduling and resource management for general batch analytics jobs (Stratus), elastic web services (Tributary), and machine learning model training (Proteus).

As an example, AWS's transient VMs leasing model is based on a price market that works as follows. The user specifies a *bid price* to indicate the maximum amount that they are willing to pay to rent a VM for an hour, while they are charged the market price. If the market price goes above the bid price during the VM's rental period, the rented VM is preempted; if the VM is preempted within the first hour of rental, the user may not get charged in some pricing models. This provides opportunities for strategic bidding based on user application reliability requirements.

People

FACULTY

Greg Ganger
Phil Gibbons

STUDENTS

Andrew Chung
Aaron Harlap (alumni)
Jun Woo Park (alumni)
Alexey Tumanov (alumni)

Publications

Realizing Value in Shared Compute Infrastructures. Andrew Chung. Carnegie Mellon University PhD Dissertation CMU-CS-22-151, December 2022.
Abstract / PDF [3M]
Stratus: Cost-aware Container Scheduling in the Public Cloud. Andrew Chung, Jun Woo Park, Gregory R. Ganger. ACM Symposium on Cloud Computing, 2018 (SoCC’18), Carlsbad, CA October 11-13, 2018.
Abstract / PDF [1.5M]
Tributary: Spot-dancing for Elastic Services with Latency SLOs. Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, Phillip B. Gibbons. 2018 USENIX Annual Technical Conference. July 11–13, 2018 Boston, MA, USA. Supersedes Carnagie Mellon University Parallel Data Lab Technical Report CMU-PDL-18-102.
Abstract / PDF [1.25M]
Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Greg Ganger, Phil Gibbons. ACM European Conference on Computer Systems, 2017 (EuroSys'17), 23rd-26th April, 2017, Belgrade, Serbia. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-102. May 2016.
Abstract / PDF [743K]

Acknowledgements

We thank the members and companies of the PDL Consortium: Amazon, Bloomberg LP, Datadog, Google, Honda, Intel Corporation, Jane Street, LayerZero Research, Meta, Microsoft Research, Oracle Corporation, Oracle Cloud Infrastructure, Pure Storage, Salesforce, Samsung Semiconductor Inc., and Western Digital for their interest, insights, feedback, and support.