PARALLEL DATA LAB 

PDL Abstract

Reducing Replication Bandwidth for Distributed Document Databases

ACM Symposium on Cloud Computing 2015. Aug. 27 - 29, 2015, Kohala Coast, HI.

Lianghong Xu, Andrew Pavlo, Sudipta Sengupta†, Jin Li†, Gregory R. Ganger

Carnegie Mellon University
Pittsburgh, PA 15213

† Microsoft Research

http://www.pdl.cmu.edu/

With the rise of large-scale, Webbased applications, users are increasingly adopting a new class of document- oriented database management systems (DBMSs) that allow for rapid prototyping while also achieving scalable performance. Like for other distributed storage systems, replication is important for document DBMSs in order to guarantee availability. The network bandwidth required to keep replicas synchronized is expensive and is often a performance bottleneck. As such, there is a strong need to reduce the replication bandwidth, especially for geo-replication scenarios where wide-area network (WAN) bandwidth is limited.

This paper presents a deduplication system called sDedup that reduces the amount of data transferred over the network for replicated document DBMSs. sDedup uses similarity-based deduplication to remove redundancy in replication data by delta encoding against similar documents selected from the entire database. It exploits key characteristics of document-oriented workloads, including small item sizes, temporal locality, and the incremental nature of document edits. Our experimental evaluation of sDedup with three real-world datasets shows that it is able to achieve up to 38× reduction in data sent over the network, significantly outperforming traditional chunk-based deduplication techniques while incurring negligible performance overhead.

FULL PAPER: pdf