PDL Abstract

EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding

12th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2–4, 2016,.

K. V. Rashmi, Mosharaf Chowdhury^, Jack Kosaian^, Ion Stoica*, Kannan Ramchandran*

Carnegie Mellon University and UC Berkeley
* University of California, Berkeley
^ University of Michigan

Data-intensive clusters and object stores are increasingly relying on in-memory object caching to meet the I/O performance demands. These systems routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. In this paper, we explore an alternative approach using erasure coding.

EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs erasure coding by: (i) splitting and erasure coding individual objects during writes, and (ii) late binding, wherein obtaining any k out of (k + r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by more than 3? and reduces the median and tail read latencies by more than 2?, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.