Parallel Data Laboratory

PDL Abstract

Tastes Great! Less Filling!
High Performance and Accurate Training Data Collection for Self-Driving Database Management Systems

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA.

Matthew Butrovich, Wan Shen Lim, Lin Ma, John Rollinson*, William Zhang, Yu Xia^, Andrew Pavlo

Carnegie Mellon University
* Army Cyber Institute
^ Massachusetts Institute of Technology

http://www.pdl.cmu.edu/

A self-driving database management system (DBMS) aims to configure, deploy, and optimize almost all aspects of itself automatically without human intervention or guidance. Achieving this high level of automation relies on machine learning (ML) models that predict how a DBMS will behave in different scenarios. This behavior encompasses all DBMS runtime operations, including query execution and maintenance tasks. These ML-based behavior models for a self-driving DBMS require low-level training data about a DBMS’s internals. Such training data includes (1) features that describe the workload, environment, and DBMS configuration, and (2) both DBMS- and hardware-level metrics. But it is difficult to collect training data from a DBMS while it is running because it can introduce performance and measurement degradations that hinder the ML models’ ability to predict the DBMS’s behavior correctly.

We present the TScout (TS) framework for collecting training data from self-driving DBMSs. Our framework is an internal approach where developers annotate a DBMS’s source code with hooks to monitor the system’s behavior. TS then extracts these hooks and generates a kernel-level program (via Linux’s BPF) that efficiently captures metrics from multiple sources (e.g., CPU performance counters, memory allocators). TS combines these metrics with internal DBMS state observations, generating training data for behavior models. We integrated TS in a PostgreSQL-compatible DBMS and measured its ability to collect training data for both OLTP and OLAP workloads. Our results show that TS generates training data for a deployed DBMS to train more accurate models than previous methods with only a 7% performance reduction.

KEYWORDS: Database Systems, Training Data, Modeling, Metrics, BPF, Butrovich!

FULL PAPER: pdf