Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-19-104, June 2019.
Qing Zheng, Charles D. Cranor, Ankush Jain, Gregory R. Ganger, Garth A. Gibson, George Amvrosiadis, Bradley W. Settlemyer†, Gary A. Grider†
Carnegie Mellon University
† Los Alamos National Laboratory
We are approaching a point in time when it will be infeasible to catalog and query data after it has been generated. This trend has fueled research on in-situ data processing (i.e. operating on data as it is streamed to storage). One important example of this approach is in-situ data indexing. Prior work has shown the feasibility of indexing at scale as a two-step process: first by partitioning data by key across CPU cores, and then by having each core produce indexes on its subset as data is persisted. Online partitioning requires that the data be shued over the network so that it can be indexed and stored by the responsible core. This is becoming more costly as new processors emphasize parallelism instead of individual core performance that is crucial for processing network events. In addition to indexing, scalable online data partitioning is also useful in other contexts such as ecient compression and load balancing.
We present FilterKV, a data management scheme for faster online data partitioning of key-value (KV) pair data. FilterKV reduces the amount of data shued over the network by: (a) moving KV pairs quickly o the network to storage, and (b) using an extremely compact representation to represent each KV pair in the communication occurring over the network. We demonstrate FilterKV on the LANL Trinity cluster, and show that it can reduce total write time (including partitioning overhead) by up to 1.9-3.0x across 4096 processor cores.
KEYWORDS: In-situ data processing, data partitioning
FULL TR: pdf