PARALLEL DATA LAB 

PDL Abstract

Online Deduplication for Distributed Databases

Ph.D. Dissertation, Carnegie Mellon University, Electrical and Computer Engineering, September 2016.

Lianghong Xu

Carnegie Mellon University

http://www.pdl.cmu.edu/

The rate of data growth outpaces the decline of hardware costs, and there has been an ever-increasing demand in reducing the storage and network overhead for online database management systems (DBMSs). The most widely used approach for data reduction in DBMSs is blocklevel compression. Although this method is simple and effective, it fails to address redundancy across blocks and therefore leaves significant room for improvement for many applications.

This dissertation proposes a systematic approach, termed similaritybased deduplication, which reduces the amount of data stored on disk and transmitted over the network beyond the benefits provided by traditional compression schemes. To demonstrate the approach, we designed and implemented dbDedup, a lightweight record-level similaritybased deduplication engine for online DBMSs. The design of dbDedup exploits key observations we find in database workloads, including small item sizes, temporal locality, and the incremental nature of record updates. The proposed approach differs from traditional chunk-based deduplication approaches in that, instead of finding identical chunks anywhere else in the data corpus, similarity-based deduplication identifies a single similar data-item and performs differential compression to remove the redundant parts for greater savings.

To achieve high efficiency, dbDedup introduces novel encoding, caching and similarity selection techniques that significantly mitigate the deduplication overhead with minimal loss of compression ratio. For evaluation, we integrated dbDedup into the storage and replication components of a distributed NoSQL DBMS and analyzed its properties using four real datasets. Our results show that dbDedup achieves up to 37X reduction in the storage size and replication traffic of the database on its own and up to 61X reduction when paired with the DBMS’s block-level compression. dbDedup provides both benefits with negligible effect on DBMS throughput or client latency (average and tail).

KEYWORDS: deduplication, databases, delta compression

FULL THESIS: pdf