GIGA+: Scalable Directories for Shared File Systems
(or, How to Build Directories with Trillions of Files)

Contact: Garth Gibson or

Overview

Traditionally file system designs have envisioned directories as a means of organizing files for human viewing; that is, directories typically contain few tens to thousands of entries. Users of large, fast file systems have begun to put millions of entries in single directories, probably as simple databases. Furthermore, many large-scale applications burstily create a file per compute core in clusters with tens to hundreds of thousands of cores.

This project is about how to build file system directories that contain billions to trillions of entries and grow the number of entries instantly with all cores contributing concurrently. The central tenet of our work is extreme scalability through high parallelism. We realize this principle by eliminating serialization, minimizing system-wide synchronization, and using weak consistency semantics for mapping information (applications and users see a consistent view of the directory contents). We build a distributed directory index, called GIGA+, that uses a unique, self-describing bitmap representation that allows the servers to encode all their local state in a compact manner and provides the clients with hints required to address the correct server. In addition, GIGA+ also handles operational realities like client and server failures, addition and removal of servers, and "request storms" that overload
any server.

Concurrent and unsynchronized growth in GIGA+ directories
GIGA+ splits a directory into multiple partitions that are distributed across many servers. Each server grows the directory independently by splitting the partition half into another partition. Each partition is identified by an increasing r−bit radix and stores a range of the keyspace, indicated as {x−y}.
Initially, directories are small and stored in a single partition on one server. As they grow in size,the server splits the partition half into a new partition on another server (indicated as the right child of a node). The old partition, with a decreasing keyspace range, remains on the same server(indicated on the left side of the node).


We are building two prototypes of our system. In the first prototype, we have added support for huge directories to an open source, cluster file system called Parallel Virtual File System (PVFS2) that is used in production on large HPC clusters. The second prototype uses the FUSE (File System in Userspace) API to build a user-level implementation of the GIGA+ indexing technique. The goal of this prototype is to develop a "library"-like platform to experiment with ideas to build scalable metadata services and make them portable across various underlying file systems like NFS/pNFS and cluster file systems that don't have support for distributed directories.

Code Releases

We plan to release the source code of our PVFS2 and FUSE prototypes.

 

Publications

 

Acknowledgements

This material is based upon research sponsored supported by the DOE Office of Advanced Scientific Computing Research (ASCR) program for Scientific Discovery through Advanced Computing (SciDAC) under Award Number DE-FC02-06ER25767, in the Petascale Data Storage Institute (PDSI). It is also supported by the Los Alamos National Lab under Award Number 54515-001-07, the CMU/LANL IRHPIT initiative.

We thank the members and companies of the PDL Consortium: American Power Conversion, Data Domain, Inc., EMC Corporation, Facebook, Google, Hewlett-Packard Labs, Hitachi, IBM, Intel Corporation, LSI, Microsoft Research, NetApp, Inc., Oracle Corporation, Seagate Technology, Sun Microsystems, Symantec Corporation and VMware, Inc. for their interest, insights, feedback, and support.