PARALLEL DATA LAB

pNFS

Problem Statement: Abstract

Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with typical NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path creating systems well-suited for data-intensive HPC and AI applications.

The PDL in collaboration with Los Alamos National Laboratories are developing an extensive testing network for pNFS intallations and research. Please see www.pnfs-storage.org for more information on clusters, testbeds, specs and benchmarks.

 

TEST ENVIRONMENTS

EMULAB

Emulab is a software platform that manages the nodes of a testbed cluster. It provides Emulab users with full bare-metal access to nodes. This allows researches to use a wide range of environments in which to develop, debug, and evaluate their systems. The primary Emulab installation is run by the Flux Group, part of the School of Computing at the University of Utah. There are also installations of the Emulab software at more than two dozen sites around the world, ranging from testbeds with a handful of nodes up to testbeds with hundreds of nodes. Emulab is widely used by computer science researchers in the fields of networking and distributed systems. It is also designed to support education and has been used to teach classes in those fields.

MVPNET

MVPNet is an MPI application that allows users to launch a set of qemu-based virtual machines (VMs) as an MPI job. Users are free to choose the guest operating systems to run and have full root access to the guest. Each mvpnet MPI rank runs its own guest VM under qemu. Guest operating systems communicate with each other using a MPI-based virtual private network managed by mvpnet. Each mvpnet guest VM has a virtual Ethernet interface configured using qemu's -netdev stream or -netdev dgram flags.The qemu program connects this type of virtual Ethernet interface to a unix domain socket file on the host system. The mvpnet application reads Ethernet frames sent by its guest OS from its Ethernet interface socket file. It then uses MPI point-to-point operations to forward the frame to the mvpnet rank running the destination guest VM. The destination mvpnet rank delivers the Ethernet frame to its guest VM by writing it to the VM's socket file. In order to route Ethernet frames, mvpnet uses a fixed mapping between its MPI rank number, the guest VM IP address, and the guest VM Ethernet hardware address. Both IPv4 ARP and Ethernet broadcast operations are supported by mvpnet.

OpenCHAMI

The LANL testbed utilizes OpenCHAMI (GitHub) to boot and manage its nodes. OpenCHAMI is an open source, microservice-based system management platform that adheres to principles of the cloud. Nodes boot images over the network in order to centrally-manage images, which are SquashFS archives built using OpenCHAMI’s image-builder tool. To reduce image complexity, post-boot configuration is handled by OpenCHAMI’s cloud-init server, which is a replacement of the upstream cloud-init by Canonical that organizes post-boot configuration by node group. Images are built using layers, building subsequent layers on top of existing layers. This is to compartmentalize image changes so that time can be saved rebuilding certain components of an image. Partners are able to perform tasks in booted images such as install packages with the expectation that these changes are ephemeral on reboot. If something in the image needs to be changed persistently, partners can request changes to the configuration of their own image layer and it will be rebuilt using the OpenCHAMI image-builder tool. Otherwise, they can submit their own image to be added to the image repository. If modifying an image is not desired, changes to the cloud-init post-boot configuration for the partner’s image can be requested.

 

Publications

  • more recent publications coming soon

  • pNFS Problem Statement. Garth Gibson, Peter Corbett. Internet Draft, July, 2004.
    archived copy

  • Parallel NFS Requirements and Design Considerations. G. Gibson, B. Welch, G. Goodson, P. Corbett. Internet Draft, October 18, 2004.
    archived copy

  • pNFS Operations Summary. Brent Welch, Benny Halevy, David Black, Andy Adamson, Dave Noveck. Internet Draft, October 18, 2004.
    archived copy

  • Ongoing development of pNFS takes place in the NFSv4 working group of the IETF

Code Downloads

RELATED RESEARCH

OPEN SOURCE

PEAK AIO - The first fully Open pNFS platform for AI and HPC Single node to super cluster. Linearly scalable data acceleration. In collaboration with Los Alamos National Laboratory, delivering a unified path from NFS silos to Tier 0 with out complexity.

PROPRIETARY

Hammerspace - NFSv4.2, using Parallel NFS with Flex Files, solves problems by providing file access that bridges silos, sites, and clouds, parallel file system performance with no need to install third-party clients and management tools, and avoids the need to rewrite applications to use object storage.

Pure Storage - Pure Storage uses Parallel Network File System (pNFS) as a key technology in its FlashBlade//EXA system to deliver high-performance, scalable file services, particularly for demanding AI and High-Performance Computing (HPC) workloads.

Links

Acknowledgements

We thank the members and companies of the PDL Consortium: Bloomberg LP, Datadog, Google, Intel Corporation, Jane Street, LayerZero Labs, Meta, Microsoft Research, Oracle Corporation, Oracle Cloud Infrastructure, Pure Storage, Salesforce, Samsung Semiconductor Inc., Uber, and Western Digital for their interest, insights, feedback, and support.