Parallel Data Laboratory

PDL Abstract

MATCHMAKER: DATA DRIFT MITIGATION IN MACHINE LEARNING FOR LARGE-SCALE SYSTEMS

Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, August 2022.

Ankur Mallick¹, Kevin Hsieh², Behnaz Arzani², Gauri Joshi¹

¹Carnegie Mellon University
²Microsoft Research

http://www.pdl.cmu.edu/

Today’s data centers rely more heavily on machine learning (ML) in their deployed systems. However, these systems are vulnerable to the data drift problem, that is, a mismatch between training data in the past and test data in the future, which can lead to significant performance degradation and system inefficiencies. In this paper, we demonstrate the impact of data drift in production by studying two real-world deployments in a leading cloud provider. Our study shows that, despite frequent model retraining, these deployed models experience major accuracy drops (up to 40%) and high accuracy variation, which lead to significant increase in operational costs. Existing solutions to the data drift problem are not designed for large-scale deployments, which need to address real-world issues such as scalability, ground truth latency, and mixed types of data drift. We propose Matchmaker, the first scalable, adaptive, and flexible solution to the data drift problem in large-scale production systems. Matchmaker finds the most similar training data batch and uses the corresponding ML model for inference on each test point. As part of Matchmaker, we introduce a novel similarity metric to address multiple types of data drifts while only incurring limited overhead. Experiments on our two real-world ML deployments show that Matchmaker significantly improves model accuracy (up to 14% and 2%), which saves 18% and 1% in the operational costs. At the same time, Matchmaker provides 8x and 4x faster predictions than a state-of-the-art ML data drift solution, AUE.

FULL TR: pdf

PARALLEL DATA LAB

PDL Publications

PDL Abstract

MATCHMAKER: DATA DRIFT MITIGATION IN MACHINE LEARNING FOR LARGE-SCALE SYSTEMS

Contact us

Recent Events

PDL Retreat 2024

PDL Retreat 2023

PDL Retreat 2022

Social Media