PARALLEL DATA LAB 

PDL Abstract

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories

ACM Transactions on Storage, Vol. 16, No. 4, Article 23. September 2020.

Qing Zheng, Charles D. Cranor, Ankush Jain, Gregory R. Ganger, Garth A. gGibson, George Amvrosiadis,
Bradley W. Settlemyer*, Gary Grider
*

Carnegie Mellon University
* Los Aalamos National Laboratory

http://www.pdl.cmu.edu/

Complex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of data generated today to derive insights. It is challenging to perform this computation, however, while fully utilizing the underlying storage media. This is because, while storage servers with large core counts are widely available, single-core performance and memory bandwidth per core grow slower than the core count per die. Computational storage offers a promising solution to this problem by utilizing dedicated compute resources along the storage processing path.We present DeltaFS Indexed Massive Directories (IMDs), a new approach to computational storage. DeltaFS IMDs harvest available (i.e., not dedicated) compute, memory, and network resources on the compute nodes of an application to perform computation on data. We demonstrate the efficiency of DeltaFS IMDs by using them to dynamically reorganize the output of a real-world simulation application across 131,072 CPU cores. DeltaFS IMDs speed up reads by 1,740× while only slightly slowing down the writing of data during simulation I/O for in situ data processing.

FULL PAPER: pdf