Over the years, improvements in storage and network hardware, as well as systems software, have been instrumental in mitigating the effect of I/O bottlenecks in HPC applications. Still, many scientific applications that read and write data in small chunks are limited by the ability of both the hardware and the software to handle such workloads efficiently. This problem will be exacerbated with exascale computers, which will allow scientific applications to run simulations at significantly larger scales. Even worse, scientific data is typically persisted out of order, creating the need to budget time and resources for a costly, massive sorting operation.
DeltaFS is a new distributed file system service that addresses these issues. To deal with the immense metadata load resulting from handling a large number of files, DeltaFS remains transient and software-defined. The transient property allows each application using DeltaFS to individually start, stop, and control the amount of computing resources dedicated to the file system by bootstrapping its own DeltaFS service across as many nodes as it needs, effectively controlling metadata performance. When combined with DeltaFS’s software-defined nature, this allows file system design and provisioning decisions to be decoupled from the overall design of HPC platforms. Our experiments (Figure 1) show that these properties allow DeltaFS to process two orders of magnitude more file creation operations compared to prior approaches.
Figure 1: File creation throughput with DeltaFS and prior work (IndexFS)
Another important goal of DeltaFS is to guarantee both fast writing and reading in the case of workloads consisting of small I/O transfers. A popular scientific application that would benefit from this is Vector-Particle-in-Cell (VPIC). VPIC simulates the movement of individual particles through their interactions, and can output particle data for each simulated time step out of order. Scientists, however, are mostly interested in accessing all data for a given particle, e.g. to study its trajectory. To improve the performance of such applications, DeltaFS uses Indexed Massive Directories (IMDs). WIthin this special directory type, out-of-order writes to a massive number of files get indexed in-situ as data is written to storage. This indexing enables quick data accesses, without the need for post-processing the data. Our VPIC experiments show a 5,000x speedup when particle trajectories are accessed through DeltaFS when 99 compute nodes are used on the LANL Trinity supercomputer. We expect further improvement as simulation size (i.e., output size) increases.
Figure 2: Performance of particle trajectory queries under DeltaFS
Los Alamos National Laboratory: Gary Grider, Brad Settlemyer, Fan Guo
Argonne National Laboratory: Robert B. Ross, Philip Carns, Matthieu Dorier, Robert Latham,
The HDF Group: Jerome Soumagne
We thank the members and companies of the PDL Consortium: Amazon, Google, Hitachi Ltd., Honda, Intel Corporation, IBM, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.