PARALLEL DATA LAB 

PDL Abstract

A Case for Packing and Indexing in Cloud File Systems

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-17-105, October 2017.

Saurabh Kadekodi, Bin Fan*, Adit Madan*, Garth A. Gibson

Carnegie Mellon University
* Alluxio Inc.

http://www.pdl.cmu.edu/

The amount of data written to a storage object, its write size, impacts many aspects of cost and performance. Ideal write sizes for cloud storage systems can be radically different from the write, or file, sizes of a particular application. For applications creating a large number of small files, creating one backing store object per small file can not only lead to prohibitively slow write performance, but can also be cost-ineffective because of the current cloud storage pricing model.

This paper proposes a packing, or bundling, layer close to the application, to transparently transform arbitrary user workloads to a write pattern more ideal for cloud storage. Implemented as a distributed write-only cache, packing coalesces small files (a few megabytes or smaller) to form gigabyte sized blobs for efficient batched transfers to cloud backing stores. Even larger benefits in price / cost can be obtained.

Our packing optimization, implemented in Alluxio (an open-source distributed file system), resulted in >25000x reduction in data ingest cost for a small file create workload and a >61x reduction in end-to-end experiment runtime.

KEYWORDS: distributed file systems, cloud storage, packing, indexing

FULL TR: pdf