Erasure codes have been widely adopted for imparting resource-efficient resilience to storage and communication systems. Coded-computation is a field of coding theory which aims to use erasure codes to impart resilience against slowdowns and failures that occur in distributed computing systems.
|Figure 1 shows an example of using coded-computation to impart resilience over the distributed computation of a function F. As depicted in the figure, coded-computation (1) encodes inputs to the computation to generate “parity inputs,” (2) performs computation F over all original and parity inputs in parallel, and (3) decodes unavailable results of computation using the available results of computation from original and parity inputs.
Given the ubiquity of distributed execution in modern services, such as web servers, prediction serving systems, data analytics systems, coded-computation offers exciting potential to enable resource-efficient resilience against slowdowns and failures. However, designing erasure codes for coded-computation is fundamentally more challenging than it is for traditional applications of erasure codes because coded-computation involves computing on encoded data. As a result, current approaches toward coded-computation are only able to support highly restricted classes of computations F. This precludes the use of coded-computation in modern distributed services that would benefit from the resource-efficient resilience of erasure codes.
In this project, we study the potential for machine learning to alleviate the difficulty of designing new erasure codes for coded-computation. We propose to integrate machine learning into the coded-computation framework and learn to reconstruct slow or failed results of computation.
We have developed multiple techniques for integrating machine learning into the coded-computation framework. As a first driving application, we have shown the promise of learning-based coded-computation to enable coded-computation for systems that perform inference over neural networks. We have shown that learning-based coded-computation enables accurate reconstruction of unavailable predictions resulting from inference, and significantly reduces tail latency in the presence of resource contention. These benefits come with only a fraction of the resource-overhead of replication-based techniques.
While we have showcased learning-based coded-computation for machine learning inference workloads, the core ideas behind our approach have the potential to expand the reach of coded-computation to a broader class of computations. This may enable erasure codes be applied more broadly in distributed systems.
Shivaram Venkataraman, U. Wisconsin-Madison
The following links contain the source code associated with the research performed in this project.
We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.