PARALLEL DATA LAB 

PDL Abstract

Reducing Replication Bandwidth for Distributed Document Databases

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-108. December 2014.

Lianghong Xu, Andrew Pavlo, Sudipta Sengupta* Jin Li*, Gregory R. Ganger

Carnegie Mellon University
*Microsoft Research

http://www.pdl.cmu.edu/

With the rise of large-scale, Web-based applications, users are increasingly adopting a new class of document-oriented database management systems (DBMSs) that allow for rapid prototyping while also achieving scalable performance. Like for other distributed storage systems, replication is an important consideration for document DBMSs in order to guarantee availability. Replication can be between failure-independent nodes in the same data center and/or in geographically diverse data centers. A replicated DBMS maintains synchronization across multiple nodes by sending operation logs (oplogs) across the network, and the network bandwidth required can become a bottleneck. As such, there is a strong need to reduce the bandwidth required to maintain secondary database replicas, especially for geo-replication scenarios where wide-area network (WAN) bandwidth is expensive and capacities grow slowly across infrastructure upgrades over time.

This paper presents a deduplication system called sDedup that reduces the amount of data transferred over the network for replicated document DBMSs. sDedup uses similarity-based deduplication to remove redundancy of documents in oplog entries by delta encoding against similar documents selected from the entire database. It exploits key workload characteristics of document-oriented workloads, including small document sizes, temporal locality, and incremental nature of document edits. Our experimental evaluation of sDedup using MongoDB with three real-world datasets shows that it is able to achieve up to 38X reduction in oplog bytes sent over the network, in addition to the standard 3X reduction from compression, significantly outperforming traditional chunk-based deduplication techniques while incurring negligible performance overhead.

KEYWORDS: Deduplication, Geo-replication, Document-oriented Databases

FULL TR: pdf