PARALLEL DATA LAB 

PDL Abstract

Scale and Concurrency of GIGA+: File System Directories with Millions of Files

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-110, September 2010.

Swapnil Patil, Garth A. Gibson

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

We examine the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. We introduce a POSIX-compliant scalable directory design, Giga+, that distributes directory entries over a cluster of server nodes that make only local, independent decisions about migration. Giga+ uses two tenets, asynchrony and inconsistency, to: (1) partition the index among all servers without any synchronization or serialization, and (2) minimize stale and inconsistent mapping state at the clients. Applications are provided traditional strong data consistency semantics, and cluster growth requires minimal directory entry migration. We have built and demonstrated that the Giga+ approach scales better than existing distributed directory implementations, delivers a sustained throughput of more than 98,000 file creates per second on a 32-server cluster, and balances load more eciently than consistent hashing.

FULL TR: pdf