Our interactive world of computing, requires fast response times as well as energy and cost efficiency. Caching is a key component to achieving these goals in web-services, analytics, and storage applications. The growth of cache deployments at every layer has exposed gaps between existing assumptions as well as heuristics and practical requirements. The CILES project aims to develop experimental prototypes, which improve the tail latency and efficiency of large deployments, and analytical models, which enable us to efficiently explore the massive design space of caching systems.
Baleen: ML Admission & Prefetching for Flash Caches:
Baleen is a flash cache for bulk storage systems that uses machine learning to train admission and prefetching policies to reduce peak backend load. It does this by introducing a new cache residency model (episodes) to guide model training, and by optimizing for an end-to-end system metric (Disk-head Time) that measures backend load for variable-size workloads more accurately than IO miss rate or byte miss rate. We also introduced OPT, an approximate offline optimal flash admission policy that is based on episodes. To our knowledge, Baleen is the first cache that optimizes peak disk-head time.
PDF / Code / Traces
AdaptSize and LHD are the first caching systems that use online models to adapt the cache eviction with the request traffic, while achieving the high throughput requirements of DRAM caches.
AdaptSize is motivated by the variable workloads at the edge of the Internet, in large content delivery networks. It’s model enables it to efficiently explore the parameter space of CDN DRAM caches. This way, AdaptSize can react to the frequent changes in traffic patterns within seconds and proves robust even under adversarial attacks, which are more common at the edge.
LHD is motivated by complex workloads in storage systems, e.g., due to applications that loop over data. By tracking reuse probability distributions on a per-application basis, LHD is able to accurately predict which objects will take longer to get their next access. By evicting such objects, which waste cache space (i.e., have the least hit density), LHD requires much less cache space to achieve the same hit ratio as heuristical policies such as LRU.
A widely-held opinion is that “caching layers . . . do not directly address tail latency, aside from configurations where it is guaranteed that the entire working set of an application can reside in a cache” [The Tail at Scale]. Yet, tail latency is of great importance in user-facing web services. We analyze a Microsoft production system and find that backend query latencies vary by more than two orders of magnitude across backends and over time, resulting in high request tail latencies.
We overcome the widely-held opinion by building RobinHood, an experimental caching system that effectively reduces tail latency. RobinHood dynamically reallocates cache resources backends which don’t affect tail latency backends which currently cause tail latency. This enables us to hide to reduce load on high-latency backends and hide their high latency. We evaluate RobinHood with production traces on a 50-server cluster with 20 different backend systems. RobinHood meets a 150ms P99 goal 99.7% of the time, whereas the next best policy meets this goal only 70% of the time.
Flash is 10x cheaper and 20-100x more power efficient than DRAM. However, caching on flash comes with its own set of challenges. First and foremost is the limited write endurance of flash cells, which is exacerbated non-sequential writes common in caching. Without limiting the continuous write rate of caches at large companies, flash caches cannot achieve their planned device lifespans by exhausting the endurance of flash cells. This is particularly important for scenarios with high churn rates such as social networks and CDNs, where the contents of the cache is continually changing.
We aim to address this challenge from two angles. First, we need to use admission policies, which prevent objects, which are likely to receive few hits, from ever being written to flash. While this approach has been recognized in prior work, we find that existing admission policies are not effective in practice. In collaboration with engineers at Facebook, we have developed an machine-learning-based admission policy, which effectively maps the per-objects features available in different applications to admission decisions. This way, we can lower write rates by 75% without sacrificing the flash cache’s hit ratio.
Our second research angle is to improve flash cache designs to result in fewer writes and amplification in the first place. This is challenging because cache misses occur for random objects and because these objects have highly variable sizes. While allocating these objects in a more sequential pattern on flash sounds promising, we would then need to track each object’s position on flash in DRAM to enable fast lookups. Due to the frequency of very small objects, this would require excessive amounts of DRAM, effectively negating the advantage of caching on flash. We are prototyping new ideas based on ideas from computer architecture to overcome this limitation.
One of the current targets of CILES is the characterization of the diversity in caching workloads and its impact on cache design. Did you know that there are few public workload traces and that hundreds of papers rely on a simple theoretical model, called the Zipf distribution, to evaluate their caching designs? Due to the scarcity of other data sources, existing caching research has started to overfit to the few public traces and the Zipf distribution. We find that the majority of production traces at major companies are neither similar to existing public traces nor fit the Zipf popularity model. For example, the proposal for tiny in-network caches [NetCache, SOSP’17] assumes that 1MB of cache space can achieve a 40% hit ratio. In contrast, we find that 1MB achieves less than 0.5% hit ratio in practice and that in-network caches have to be several magnitudes larger in practice.
The CILES project aims to develop a workload model that is general enough to capture key characteristics of caching workloads. We hope that this workload model can be widely shared among researchers to ensure that our design can be more readily translated into practice.
September 2, 2021: Facebook's CacheLib project is live.
Akamai: Ramesh Sitaraman
Facebook: Sathya Gunasekar, Jimmy Lu, Hao Wu
Google / Youtube: Pawel Jurczyk
Google: Carlos Villavieja, David Bacon
Microsoft: Daniel S. Berger
Two Sigma: Joshua Leners, Larry Rudolph
We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.