PDL Abstract

Tributary: Spot-dancing for elastic services with latency SLOs

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-18-102, Jan 2018.

Aaron Harlap†*, Andrew Chung†*, Alexey Tumanov§, Gregory R. Ganger†, Phillip B. Gibbons†

† Carnegie Mellon University
§ UC Berkeley

* Denotes equal contribution

The Tributary elastic control system embraces the uncertain nature of transient cloud resources, such as AWS spot instances, to manage elastic services with latency SLOs more robustly and more cost-effectively. Such resources are available at lower cost, but with the proviso that they can be preempted en masse, making them risky to rely upon for business-critical services. Tributary creates models of preemption likelihood and exploits the partial independence among different resource offerings, selecting collections of resource allocations that will satisfy SLO requirements and adjusting them over time as client workloads change. Although Tributary’s collections are often larger than required in the absence of preemptions, they are cheaper because of both lower spot costs and partial refunds for preempted resources. At the same time, the often-larger sets allow unexpected workload bursts to be handled without SLO violation. Over a range of web service workloads, we find that Tributary reduces cost for achieving a given SLO by 81–86% compared to traditional scaling on non-preemptible resources and by 47–62% compared to the high-risk approach of the same scaling with spot resources.

KEYWORDS: Big Data infrastructure, Big Learning systems

FULL TR: pdf