Internet DRAFT - draft-welch-pnfs-ops

INTERNET-DRAFT                                              Brent Welch
                                                           Panasas Inc.
                                                           Benny Halevy
                                                           Panasas Inc.
                                                            David Black
                                                        EMC Corporation
                                                           Andy Adamson
                                            CITI University of Michigan
                                                            Dave Noveck
                                                      Network Appliance

Document: draft-welch-pnfs-ops-00.txt                      October 2004

Expires: April 2005

pNFS Operations Summary
October 2004

Status of this Memo

  By submitting this Internet-Draft, I certify that any applicable
  patent or other IPR claims of which I am aware have been disclosed,
  or will be disclosed, and any of which I become aware will be
  disclosed, in accordance with RFC 3668.

  Internet-Drafts are working documents of the Internet Engineering
  Task Force (IETF), its areas, and its working groups.  Note
  that other groups may also distribute working documents as

  Internet-Drafts are draft documents valid for a maximum of six months
  and may be updated, replaced, or obsoleted by other documents at
  any time.  It is inappropriate to use Internet-Drafts as reference
  material or to cite them other than as "work in progress."

  The list of current Internet-Drafts can be accessed at
  The list of Internet-Draft Shadow Directories can be accessed at

Copyright Notice

  Copyright (C) The Internet Society (2004). All Rights Reserved.

  This Internet-Draft provides a description of the pNFS extension
  for NFSv4.

  The key feature of the protocol extension is the ability for clients
  to perform read and write operations that go directly from the
  client to individual storage system elements without funneling
  all such accesses through a single file server.  Of course, the
  file server must coordinate the client I/O so that the file system
  retains its integrity.

welch-pnfs-ops           Expires - April 2005                   [Page 1]
Internet-Draft         pNFS Operations Summary              October 2004

  The extension adds operations that query and manage layout
  information that allows parallel I/O between clients and storage
  system elements.  The layouts are managed in a similar way as
  delegations in that they have leases and can be recalled by the
  server, but layout information is independent of delegations.

Table of Contents

1. Introduction                                                       3
2. General Definitions                                                3
2.1 Metadata                                                          3
2.2 Storage Device                                                    4
2.3 Storage Protocol                                                  4
2.4 Management Protocol                                               4
2.5 Layout                                                            4
3. Layouts and Aggregation                                            5
4. Security Information                                               6
4.1 Object Storage Security                                           6
4.2 File Security                                                     6
4.3 Block Security                                                    7
5. pNFS Typed data structures                                         7
5.1 pnfs_layoutclass4                                                 7
5.2 pnfs_deviceid4                                                    7
5.3 pnfs_devaddr4                                                     7
5.4 pnfs_devlist_item4                                                8
5.5 pnfs_layouttype4                                                  8
5.6 pnfs_layout4                                                      8
6. pNFS File Attributes                                               9
6.1 pnfs_layoutclass4<> LAYOUT_CLASSES                                9
6.2 pnfs_layouttype4 LAYOUT_TYPE                                      9
6.3 pnfs_layouttype4    LAYOUT_HINT                                   9
7. pNFS Error Definitions                                             9
8. pNFS Operations                                                    9
8.1 LAYOUTGET - Get Layout Information                                9
8.2 LAYOUTCOMMIT - Commit writes made using a layout                 11
8.3 LAYOUTRETURN - Release Layout Information                        13
8.4 GETDEVICEINFO - Get Device Information                           14
8.5 GETDEVICELIST - Get List of Devices                              15
9. Callback Operations                                               16
9.1 CB_LAYOUTRECALL                                                  16
10. Usage Scenarios                                                  17
10.1 Basic Read Scenario                                             17
10.2 Multiple Reads to a File                                        17
10.3 Multiple Reads to a File with Delegations                       17
10.4 Read with existing writers                                      18
10.5 Read with later conflict                                        18
10.6 Basic Write Case                                                18
10.7 Large Write Case                                                19
10.8 Create with special layout                                      19
11. Layouts and Aggregation                                          19
11.1 Simple Map                                                      19
11.2 Block Map                                                       19
11.3 Striped Map (RAID 0)                                            20
11.4 Replicated Map                                                  20

welch-pnfs-ops           Expires - April 2005                   [Page 2]
Internet-Draft         pNFS Operations Summary              October 2004

11.5 Concatenated Map                                                20
11.6 Nested Map                                                      20
12. Issues                                                           21
12.1 Storage Protocol Negotiation                                    21
12.2 Crash recovery                                                  21
12.3 Storage Errors                                                  21
13. References                                                       22
14. Acknowledgments                                                  22
15. Author's Addresses                                               22
16. Full Copyright Notice                                            23

welch-pnfs-ops           Expires - April 2005                   [Page 2]
Internet-Draft         pNFS Operations Summary              October 2004

1. Introduction

  The pNFS extension to NFSv4 takes the form of new operations that
  return data location information called a "layout".  The layout
  is protected by layout delegations.  When a client has a layout
  delegation, it has rights to access the data directly using
  the location information in the layout.  There are both read and
  write layouts and they may only apply to a sub-range of the file's

  The layout delegations are managed in a similar fashion as NFSv4
  data delegations (e.g., they are recallable and revocable), but they
  are distinct abstractions and are manipulated with new operations
  as described below.  To avoid any confusion between the existing
  NFSv4 data delegations and layout delegations, the term "layout"
  implies "layout delegation".

  There are new attributes that describe general layout
  characteristics.  However, attributes do not provide all we need
  to support layouts, hence the use of operations instead.

  Finally, there are issues about how layout delegations interact
  with the existing NFSv4 abstractions of data delegations and byte
  range locking.  These issues (and more) are also discussed here.

2. General Definitions

  This protocol extension partitions the file system protocol into
  two parts, the control path and the data path.  The control path is
  implemented by the extended (p)NFSv4 file server, while the data
  path may be implemented by direct communication between the file
  system client and the storage devices.  This leads to a few new
  terms used to describe the protocol extension.

2.1 Metadata

  This is information about a file, like its name, owner, where it
  stored, and so forth.  The information is managed by the File server

welch-pnfs-ops           Expires - April 2005                   [Page 3]
Internet-Draft         pNFS Operations Summary              October 2004

  (sometimes called the metadata manager).  Metadata also includes

  lower-level information like block addresses and indirect block
  pointers.  Depending the storage protocol, block-level metadata may
  or may not be managed by the File server, but is instead managed
  by Object Storage Devices or other File servers acting as a Storage

2.2 Storage Device

  This is a device, or server, that controls the file's data, but
  leaves other metadata management up to the file server (i.e.,
  metadata manager).  A Storage Device could be another NFS server,
  or an Object Storage Device (OSD) or a block device accessed over a
  SAN (either FiberChannel or iSCSI SAN).  The goal of this extension
  is to allow direct communication between clients and storage devices.

2.3 Storage Protocol            

  This is the protocol between the client and the storage device
  used to access the file data.  There are three primary types:
  file protocols (such as NFSv4 or NFSv3), object protocols (OSD),
  and block protocols (SCSI-block commands, or "SBC"). These protocols
  are in turn layered over transport protocols such as RPC/TCP/IP or
  iSCSI/TCP/IP or FC/SCSI.  We anticipate there will be variations on
  these storage protocols, including new protocols that are unknown
  at this time or experimental in nature.  The details of the storage
  protocols will be described in other documents so that pNFS clients
  can be written to use these storage protocols.

2.4 Management Protocol 

  This is the protocol between the File server and the Storage devices.
  This protocol is outside the scope of this draft, and is used
  for various management activities that include storage allocation
  and deallocation.  For example, the regular NFSv4 OPEN operation
  is used to create a new file.  This is applied to the File Server,
  which in turn uses the management protocol to allocate storage on
  the storage devices.  The file server returns a layout for the
  new file that the client uses to access the new file directly.
  The management protocol could be entirely private to the File server
  and Storage devices, and need not be published in order to implement
  a pNFS client that uses the associated Storage protocol.

2.5 Layout

  (Also, "map") A layout defines how a file's data is organized on one
  or more storage devices.  There are many possible layout types. They
  vary in the storage protocol used to access the data, and in the
  aggregation scheme that lays out the file data on the underlying
  storage devices.  Layouts are described in more detail below.

welch-pnfs-ops           Expires - April 2005                   [Page 4]
Internet-Draft         pNFS Operations Summary              October 2004

3. Layouts and Aggregation

  The layout, or "map", is a typed data structure that has variants
  to handle different storage protocols (block, object, and file).
  A layout describes a range of a file's contents.  For example,
  a block layout might be an array of tuples that store (deviceID,
  block_number, block count) along with information about block size
  and the file offset of the first block.  An object layout is an
  array of tuples (deviceID, objectID) and an additional structure
  (i.e., the aggregation map) that defines how the logical byte
  sequence of the file data is serialized into the different objects.
  A file layout is an array of tuples (deviceID, file_handle), along
  with a similar aggregation map.

  The deviceID is a short name for a storage device.  In practice, a
  significant amount of information may be required to fully identify
  a storage device.  Instead of embedding all that information in
  a layout, a level of indirection is used.  Layouts embed device
  Ids, and a new op (GETDEVICEINFO) is used to retrieve the complete
  identity information about the storage device.  For example, the
  identity of a file server or object server could be an IP address
  and port.  The identity of a block device could be a volume label.
  Due to multipath connectivity in a SAN environment, agreement on a
  volume label is considered the reliable way to locate a particular
  storage device.

  Aggregation schemes can describe layouts like simple one-to-one
  mapping, concatenation, and striping.  A general aggregation
  scheme allows nested maps so that more complex layouts can be
  compactly described.  The canonical aggregation type for this
  extension is striping, which allows a client to access storage
  devices in parallel. Even a one-to-one mapping use useful for
  a file server that wishes to distribute its load among a set of
  other file servers. There are also experimental aggregation types
  such as writeable mirrors and RAID, however these are outside the
  scope of this document.

  The file server is in control of the layout for a file, but the
  client can provide hints to the server when a file is opened or
  created about preferred layout parameters.  The pNFS extension
  introduces a LAYOUT_HINT attribute that the client can query at
  anytime, and can set with a compound SETATTR after OPEN to provide
  a hint to the server for new files.

  While not completely specified in this summary, there must be
  adjunct specifications that precisely define layout formats to allow
  interoperability among clients and metadata servers.  The point is
  that the metadata server will give out layouts of a particular class
  (block, object, or file) and aggregation, and the client needs to
  select a "layout driver" that understands how to use that layout.
  The API used by the client to talk to its drivers is outside the
  scope of the pNFS extension, but is an important notion to keep in
  mind when thinking about this work. The storage protocol between
  the client's layout driver and the actual storage is covered by

welch-pnfs-ops           Expires - April 2005                   [Page 5]
Internet-Draft         pNFS Operations Summary              October 2004

  other protocols such as SBC (block storage), OSD (object storage)
  or NFS (file storage).
4. Security Information

  All existing NFS security mechanisms apply to the operations added by
  this extension.  However, this extension is used in conjunction with
  other storage protocols for client to storage access.  Each storage
  protocol introduces its own security constraints. Clients may need
  security information in order to complete direct data access.  The
  rest of this section gives an overview of the security schemes used
  by different storage protocols.  However, the details are outside the
  scope of this protocol extension and private to the storage protocol.
  We only assume that the file server returns security tokens to the
  client that uses them when accessing storage.  The file server does
  permission checking before issuing the security tokens.

4.1 Object Storage Security

  The object storage protocol relies on a cryptographically secure
  capability to control accesses at the object storage devices.
  Capabilities are generated by the metadata server, returned to the
  client, and passed to the object storage device, which verifies
  that the capability allows the requested operation.

  Each capability is specific to a particular object, an operation
  on that object, a byte range w/in the object, and has an explicit
  expiration time.  The capabilities are signed with a secret key
  that is shared by the object storage devices (OSD) and the metadata
  managers.  Typically each OSD has a set of master keys and working
  keys, and the working keys are rotated periodically under the
  control of the metadata manager.  Clients do not have device keys
  so they are unable to forge capabilities.  Capabilities need to
  be protected from snooping, which can be done by using facilities
  such as Ipsec to create a secure VPN that contains the clients,
  the file server, and the storage devices.

4.2 File Security

  The file storage protocol has the same security mechanism between
  the client and metadata server as between the client and data server.
  This implies that the files that store the data need the same ACL as
  the metadata file that represents the "control point" for the file.
  This ensures that access control decisions are consistent between
  the metadata server and the data server.

  One alternative that was briefly discussed was the introduction
  of special file handles that essentially have the properties of
  capabilities so they can be generated by the metadata servers and
  checked by the data servers.  (Peter Corbett described "one shot"
  file handles.)  To be effective, these need all the properties of a
  capability so the data server can efficiently and securely enforce
  the access control decisions made by the metadata manager.

welch-pnfs-ops           Expires - April 2005                   [Page 6]
Internet-Draft         pNFS Operations Summary              October 2004

  [We need to elaborate on this section. We should be able to
  leverage the NFSv4 GSS context between the client and the NFSv4
  "Storage Devices".]

4.3 Block Security

  The block model relies on SAN-based security, and trusts that
  clients will only access the blocks they have been directed to use.
  In these systems, there may not need to be any additional security
  information returned with the map.  There are LUN masking/unmapping
  and zone-based security schemes that can be manipulated to fence
  clients from each other's data.  These are fairly heavy weight
  operations that are not expected to be part of the normal execution
  path for pNFS. But, a metadata server can always fall back to these
  mechanisms if it needs to prevent a client from accessing storage
  (i.e., "fence the client".)

5. pNFS Typed data structures

5.1 pnfs_layoutclass4   

  uint16_t pnfs_layoutclass4;
  A layout class specifies a family of layout types.  The implication
  is that clients have "layout drivers" for one or more layout classes.
  The file server advertises the layout classes it supports through
  the LAYOUT_CLASSES file system attribute.  A client asks for layouts
  of a particular class in LAYOUTGET, and passes those layouts to its
  layout driver.  A layout is further typed by a pnfs_layouttype4
  that identifies a particular layout in the family of layouts of
  that class.  Custom installations should be allowed to introduce
  new layout classes.

  [There is an IANA issue here for the initial set of well known
  layout classes.  There should also be a reserved range for custom
  layout classes used in local installations.]

5.2 pnfs_deviceid4

  unsigned uint32_t pnfs_deviceid4;             /* 32-bit device ID */

  Layout information includes device IDs that specify a data server
  with a compact handle.  Addressing and type information is obtained
  with the GETDEVICEINFO operation.

5.3 pnfs_devaddr4

  struct pnfs_devaddr4 {
         uint16_t type;
         string r_netid<>;      /* network ID */
         string r_addr<>;       /* Universal address */

  This value is used to set up a communication channel with the

welch-pnfs-ops           Expires - April 2005                   [Page 7]
Internet-Draft         pNFS Operations Summary              October 2004

  storage device.  For now we borrow the structure of a clientid4,
  and assume we will be able to specify SAN devices as well as TCP/IP
  devices using this format.  The type is used to distinguish between
  known types.

  [TODO: we need an enum of known device address types.  These include
  IP+port for file servers and object storage devices.  There may be
  several types for different variants on SAN volume labels.
  Do we need a concrete definition of volume labels for
  SAN block devices?  We have discussed a scheme where the volume
  label is defined as a set of tuples <offset, length, value> that
  allow matching on the initial contents of a SAN volume in order to
  determine equality.  If we do this, is this type a discriminated
  union with a fixed number of branches?  One type would be an IP/port
  combination for an NFS or iSCSI device.  Another type would be this
  volume label specification.]

5.4 pnfs_devlist_item4

  struct pnfs_devlist_item4 {
         pnfs_deviceid4         id;
         nfs_deviceaddr4  addr;

  An array of these values is returned by the GETDEVICELIST operation.
  They define the set of devices associated with a file system.

5.5 pnfs_layouttype4

  struct pnfs_layouttype4 {
         pnfs_layoutclass4 class;
         uint16_t type;

  The protocol extension enumerates known layout types and their
  structure.  Additional layout types may be added later.  To allow
  for graceful extension of layout types, the type is broken into
  two fields.

  [TODO: We should chart out the major layout classes and
  representative instances of them, then indicate how new layout
  classes can be introduced.  Alternatively, we can put these
  definitions into the document that specifies the storage protocol.]

5.6 pnfs_layout4

  union pnfs_layout4 switch (pnfs_layouttype4 type) {
               opaque layout_data<>;

  This opaque type defines a layout.  As noted, we need to flesh out
  this union with a number of "blessed" layouts for different storage
  protocols and aggregation types.

welch-pnfs-ops           Expires - April 2005                   [Page 8]
Internet-Draft         pNFS Operations Summary              October 2004

6. pNFS File Attributes

6.1 pnfs_layoutclass4<> LAYOUT_CLASSES

  This attribute applies to a file system and indicates what layout
  classes are supported by the file system.  We expect this attribute
  to be queried when a client encounters a new fsid.  This attribute is
  used by the client to determine if it has applicable layout drivers.

6.2 pnfs_layouttype4 LAYOUT_TYPE

  This attribute indicates the particular layout type used for a file.
  This is for informational purposes only.  The client needs to use
  the LAYOUTGET operation in order to get enough information (e.g.,
  specific device information) in order to perform I/O.

6.3 pnfs_layouttype4    LAYOUT_HINT

  This attribute is set on newly created files to influence the file
  server's choice for the file's layout.

7. pNFS Error Definitions

        NFS4ERR_LAYOUTUNAVAILABLE       Layouts are not available
        for the file or its containing file system.

        NFS4ERR_LAYOUTTRYLATER          Layouts are temporarily
        unavailable for the file, client should retry later.

8. pNFS Operations

8.1 LAYOUTGET - Get Layout Information


        (cfh), storage_type, iomode, sharemode, offset, length ->
                layout_stateid, layout


        enum layoutget_iomode4 {
                LAYOUTGET_READ          = 1,
                LAYOUTGET_WRITE         = 2,
                LAYOUTGET_RW            = 3

        enum layoutget_sharemode4 {
                LAYOUTGET_SHARED        = 1,
                LAYOUTGET_EXCLUSIVE     = 2

        struct LAYOUTGET4args {
                /* CURRENT_FH: file */
                pnfs_layoutclass4       layout_class;

welch-pnfs-ops           Expires - April 2005                   [Page 9]
Internet-Draft         pNFS Operations Summary              October 2004

                layoutget_iomode4       iomode;
                layoutget_sharemode4    sharemode;
                offset4                 offset;
                length4                 length;


        struct LAYOUTGET4resok {
                stateid4                layout_stateid;
                pnfs_layout4            layout;

        union LAYOUTGET4res switch (nfsstat4 status) {
                case NFS4_OK:
                        LAYOUTGET4resok resok4;


  Requests a layout for reading or writing the file given by the
  filehandle at the byte range given by offset and length.  The client
  requests either a shared or exclusive sharing mode for the layout
  to indicate whether it provides its own synchronization mechanism.
  A shared layout allows cooperating clients to perform direct I/O
  using a layout that potentially conflicts with other clients.
  The clients are asserting that they are aware of this issue and
  can coordinate via an external mechanism (either NFSv4 advisory
  locks or, e.g., MPI-IO toolkit).  An exclusive layout means that
  the client wants the server to prevent other clients from making
  conflicting changes to the part of the file covered by the layout.
  An exclusive read layout, for example, would not be granted at the
  same time as there was an outstanding write layout that overlapped
  the range.  Multiple exclusive read layouts can be given out for the
  same file range.  An exclusive write layout can only be given out
  if there are no other outstanding layouts for the specified range.

  Issue - there is some debate about the default value for sharemode
  in client implementations.   One view is that the safest scheme is
  to require applications to request shared layouts explicitly via,
  e.g., an ioctl() operation.  Another view is that shared layouts
  during concurrent access provide the same risks and guarantees that
  NFS does today (i.e., there is only open-to-close sharing semantics)
  and that applications "know" they should use advisory locking to
  serialize access when they anticipate sharing.  By specifying the
  sharemode in the protocol, we support both points of view.

  The LAYOUTGET operation returns layout information for the specified
  byte range. To get a layout from a specific offset through the
  end-of-file (no matter how long the file actually is) use a length
  field with all bits set to 1 (one).  If the length is zero, or if
  a length which is not all bits set to one is specified, and length

welch-pnfs-ops           Expires - April 2005                  [Page 10]
Internet-Draft         pNFS Operations Summary              October 2004

  when added to the offset exceeds the maximum 64-bit unsigned integer
  value, the error NFS4ERR_INVAL will result.

  The format of the returned layout is specific to the underlying
  file system and is specified outside of this document.

  If layouts are not supported for the requested file or its containing
  filesystem the server should return NFS4ERR_LAYOUTUNAVAILABLE.

  If layout for the file is unavailable due to transient conditions,
  e.g. file sharing prohibits layouts, the server should return

  On success, the current filehandle retains its value.


  Typically, LAYOUTGET will be called as part of a compound RPC
  after an OPEN operation and results in the client having location
  information for the file. The client specifies a layout class that
  limits what kind of layout the server will return.  This prevents
  servers from issuing layouts that are unusable by the client.



8.2 LAYOUTCOMMIT - Commit writes made using a layout


        (cfh), layout_stateid, offset, length, neweof, newlayout ->


        union neweof4 switch (bool eofchanged) {
                case TRUE:
                        length4         eof;
                case FALSE:
        struct LAYOUTCOMMIT4args {
                /* CURRENT_FH: file */
                stateid4                layout_stateid;
                neweof4                 neweof;
                offset4                 offset;
                length4                 length;
                opaque                  newlayout<>;

welch-pnfs-ops           Expires - April 2005                  [Page 11]
Internet-Draft         pNFS Operations Summary              October 2004

        struct LAYOUTCOMMIT4resok {
                stateid4                layout_stateid;

        union LAYOUTFLUSH4res switch (nfsstat4 status) {
                case NFS4_OK:
                        LAYOUTFLUSH4resok  resok4;


  Commit changes in the layout represented by the current filehandle
  and stateid.

  The LAYOUTCOMMIT operation indicates that the client has completed
  writes using a layout obtained by a previous LAYOUTGET. The client
  may have only written a subset of the data range it previously
  requested. LAYOUTCOMMIT allows it to commit or discard provisionally
  allocated space and to update the server with a new end of file.

  The layout argument to LAYOUTCOMMIT describes what regions have been
  used and what regions can be deallocated. The resulting layout is
  still valid after LAYOUTCOMMIT and can be referenced by the returned
  stateid for future operations.

  The layout information is more verbose for block devices than
  for objects and files because the later hide the details of block
  allocation behind their storage protocols. At the minimum, the client
  needs to communicate changes to the end of file location back to
  the server, and its view of the file modify and access times. For
  blocks, it needs to specify precisely which blocks have been used.

  The client may use a SETATTR operation in a compound right after
  LAYOUTCOMMIT in order to set the access and modify times of the file.
  Alternatively, the server could use the time of the LAYOUTCOMMIT
  operation as the file modify time.

  On success, the current filehandle retains its value.



welch-pnfs-ops           Expires - April 2005                  [Page 12]
Internet-Draft         pNFS Operations Summary              October 2004

8.3 LAYOUTRETURN - Release Layout Information


        (cfh), layout_stateid ->


        struct LAYOUTRETURN4args {
                /* CURRENT_FH: file */
                stateid4        layout_stateid;


        struct LAYOUTRETURN4res {
                nfsstat4        status;


  Returns the layout represented by the current filehandle and
  layout_stateid. After this call, the client must not use the layout
  and the associated storage protocol to access the file data.  Before
  it can do that, it must get a new layout delegation with LAYOUTGET.

  Layouts may be returned when recalled or voluntarily (i.e.,
  before the server has recalled them). In either case the client
  must    properly propagate state changed under the context of the
  layout to storage or to the server before returning the layout.

  On success, the current filehandle retains its value.

  If a client fails to return a layout in a timely manner, then the
  File server should use its management protocol with the storage
  devices to fence the client from accessing the data referenced by
  the layout.

  [TODO: We need to work out how clients return error information if
  they encounter problems with storage.  We could return a single
  OK bit, or we could return more extensive information from the
  layout driver that describes the error condition in more detail.
  It seems like we need an opaque "layout_error" type that is defined
  by the storage protocol along with its layout types.]



welch-pnfs-ops           Expires - April 2005                  [Page 13]
Internet-Draft         pNFS Operations Summary              October 2004

8.4 GETDEVICEINFO - Get Device Information


        (cfh), device_id -> device_addr


        struct GETDEVICEINFO4args {
                pnfs_deviceid4                  device_id;

        struct GETDEVICEINFO4resok {
                pnfs_devaddr4                   device_addr;

        union GETDEVICEINFO4res switch (nfsstat4 status) {
                case NFS4_OK:
                        GETDEVICEINFO4resok     resok4;


  Returns device type and device address information for a specified
  device.  The returned device_addr includes a type that indicates
  how to interpret the addressing information for that device.  [TODO:
  or, it is a discriminated union.]  At this time we expect two main
  kinds of device addresses, either IP address and port numbers,
  or SCSI volume identifiers.  The final protocol specification will
  detail the allowed values for device_type and the format of their
  associated location information.

  Note, it is possible that address information for a deviceID
  changes dynamically due to various system reconfiguration events.
  Clients may get errors on their storage protocol that causes them
  to query the metadata server with GETDEVICEINFO and refresh their
  information about a device.

welch-pnfs-ops           Expires - April 2005                  [Page 14]
Internet-Draft         pNFS Operations Summary              October 2004

8.5  GETDEVICELIST - Get List of Devices


        (cfh) -> device_addr<>


        /* Current file handle */


        struct GETDEVICELIST4resok {
                pnfs_devlist_item4              device_addr_list<>;

        union GETDEVICEINFO4res switch (nfsstat4 status) {
                case NFS4_OK:
                        GETDEVICEINFO4resok     resok4;


  In some applications, especially SAN environments, it is convenient
  to find out about all the devices associated with a file system.
  This lets a client determine if it has access to these devices,
  e.g., at mount time.  This operation returns a list of items that
  establish the association between the short pnfs_deviceid4 and the
  addressing information for that device.

welch-pnfs-ops           Expires - April 2005                  [Page 15]
Internet-Draft         pNFS Operations Summary              October 2004

9. Callback Operations



        stateid, fh ->


        struct CB_LAYOUTRECALLargs {
                stateid4        stateid;
                nfs_fh4         fh;


        struct CB_LAYOUTRECALLres {
                nfsstat4        status;


  The CB_LAYOUTRECALL operation is used to begin the process of
  recalling a layout and returning it to the server.

  If the handle specified is not one for which the client holds a
  layout, an NFS4ERR_BADHANDLE error is returned.

  If the stateid specified is not one corresponding to a valid layout
  for the file specified by the filehandle, an NFS4ERR_BAD_STATEID
  is returned.

  Issue: We have debated about another kind of callback to push new
  EOF information to the client.  May not be necessary.  The client
  could discover that via polling for attirbutes.


  The client should reply to the callback immediately. Replying does
  not complete the recall except when an error was returned. The recall
  is not complete until the layout is returned using a LAYOUTRETURN.

  The client should complete any in-flight I/O operations using
  the recalled layout before returning it via LAYOUTRETURN.  If the
  client has buffered dirty data, it may chose to write it directly
  to storage before calling LAYOUTRETURN, or to write it later using
  normal NFSv4 WRITE operations.



welch-pnfs-ops           Expires - April 2005                  [Page 16]
Internet-Draft         pNFS Operations Summary              October 2004

10. Usage Scenarios

  This section has a description of common open, close, read, write
  interactions and how those work with layout delegations. [TODO:
  this section feels rough and I'm not sure it adds value in its
  present form.]

10.1 Basic Read Scenario

  Client does an OPEN to get a file handle.
  Client does a LAYOUTGET for a range of the file, gets back a layout.
  Client uses the storage protocol and the layout to access the file.
  Client returns the layout with LAYOUTRETURN
  Client closes stateID and open delegation with CLOSE.

  This is rather boring as the client is careful to clean up all server
  state after only a single use of the file.

10.2 Multiple Reads to a File

  Client does an OPEN to get a file handle.
  Client does a LAYOUTGET for a range of the file, gets back a layout.
  Client uses the storage protocol and the layout to access the file.
  Client closes stateID and with CLOSE.

  Client does an OPEN to get a file handle.
  Client finds cached layout associated with file handle.
  Client uses the storage protocol and the layout to access the file.
  Client closes stateID and with CLOSE.

  A bit more interesting as we've saved the LAYOUTGET operation, but
  we are still doing server round-trips.

10.3 Multiple Reads to a File with Delegations

  Client does an OPEN to get a file handle and an open delegation.
  Client does a LAYOUTGET for a range of the file, gets back a layout.
  Client uses the storage protocol and the layout to access the file.
  Application does a close(), but client keeps state under the
  (time passes)
  Application does another open(), which client handles under the
  Client finds cached layout associated with file handle.
  Client uses the storage protocol and the layout to access the file.
  (pattern continues until open delegation and/or layout is recalled)

  This illustrates the efficiency of combining open delegations and
  layouts to eliminate interactions with the file server altogether.
  Of course, we assume the client's operating system is only allowing
  the local open() to succeed based on the file permissions.  The use
  of layouts does not change anything about the semantics of open

welch-pnfs-ops           Expires - April 2005                  [Page 17]
Internet-Draft         pNFS Operations Summary              October 2004

10.4 Read with existing writers

  NOTE: This scenario was under some debate, but we have resolved
  that the server is able to give out overlapping/conflicting layout
  information to different clients.  In these cases we assume
  that clients are using an external mechanism such as MPI-IO to
  synchronize and serialize access to shared data.  One can argue that
  even unsynchronized clients get the same open-to-close consistency
  semantics as NFS already provides, even when going direct to storage.

  Client does an OPEN to get an open stateID and open delegation
  The file is open for writing elsewhere by different clients and so
  no open delegation is returned.
  Client does a LAYOUT get and gets a layout from the server.
  Client either synchronizes with the writers, or not, and accesses data
  via the layout and storage protocol.  There are no guarantees about
  when data that is written by the writer is visible to the reader.
  Once the writer has closed the file and flushed updates to storage,
  then they are visible to the client.
  [TODO: we really aren't explaining the sharemode field here.]

10.5 Read with later conflict

  ClientA does an OPEN to get an open stateID and open delegation.
  ClientA does a LAYOUTGET for a range of the file, gets back a map
  and layout stateid.
  ClientA uses the storage protocol to access the file data.
  ClientB opens the file for WRITE
  File server issues CB_RECALL to ClientA
  ClientA issues DELEGRETURN

  ClientA continues to use the storage protocol to access file data.
  If it is accessing data from its cache, it will periodically
  check that its data is still up-to-date because it has no open
  delegation. [This is an odd scenario that mixes in open delegations
  for no real value.  Basically this is a "regular writer" being mixed
  with a pNFS reader.  I guess this example shows that no particular
  semantics are provided during the simultaneous access.  If the server
  so chose, it could also recall the layout with CB_LAYOUTRECALL to
  force the different clients to serialize at the file server.]

10.6 Basic Write Case

  Client does an OPEN to get a file handle.
  Client does a LAYOUTGET for a range of the file, gets back a layout
  and layout stateid.
  Client writes to the file using the storage protocol.
  Client uses LAYOUTCOMMIT to communicate new EOF position.
  Client does SETATTR to update timestamps
  Client does a LAYOUTRETURN
  Client does a CLOSE

  Again, the boring case where the client cleans up all of its server
  state by returning the layout.

welch-pnfs-ops           Expires - April 2005                  [Page 18]
Internet-Draft         pNFS Operations Summary              October 2004

10.7 Large Write Case

  Client does an OPEN to get a file handle.
  Client does a LAYOUTGET for a range of the file, gets back a layout
  and layout stateid.
  Client writes to the file using the storage protocol.
  Client fills up the range covered by the layout.
  Client updates the server with LAYOUTCOMMIT, communicating about new
  EOF position.
  Client does SETATTR to update timestamps.
  Client releases the layout with LAYOUTRELEASE
  (end loop)
  Client does a CLOSE

10.8 Create with special layout

  Client does an OPEN and a SETATTR that specifies a particular layout
  type using the LAYOUT_HINT attribute.
  Client gets back an open stateID and file handle.

11. Layouts and Aggregation

  This section describes several layout formats in a semi-formal way
  to provide context for the layout delegations. These definitions
  will be formalized in other protocols.  However, the set of
  understood types is part of this protocol in order to provide for
  basic interoperability.

  The layout descriptions include <deviceID, objectID> tuples
  that identify some storage object on some storage device.
  The addressing formation adsociated with the deviceID is obtained
  with GETDEVICEINFO.  The interpretation of the objectID depends on
  the storage protocol.  The objectID could be a filehandle for an
  NFSv4 data server.  It could be a OSD object ID for an object server.
  The layout for a block device generally includes additional block
  map information to enumerate blocks or extents that are part of
  the layout.

11.1 Simple Map

  The data is located on a single storage device.  In this case the
  file server can act as the front end for several storage devices
  and distribute files among them.  Each file is limited in its size
  and performance characteristics by a single storage device. The
  simple map consists of <deviceID, objectID>.

11.2 Block Map

  The data is located on a LUN in the SAN.  The layout consists of
  an array of <deviceID, blockID, blocksize> tuples.  Alternatively,
  the blocksize could be specified once to apply to all entries in
  the layout.

welch-pnfs-ops           Expires - April 2005                  [Page 19]
Internet-Draft         pNFS Operations Summary              October 2004

11.3 Striped Map (RAID 0)

  The data is striped across storage devices.  The parameters of the
  stripe include the number of storage devices (N) and the size of
  each stripe unit (U).  A full stripe of data is N * U bytes. The
  stripe map consists of an ordered list of <deviceID, objectID>
  tuples and the parameter value for U.  The first stripe unit (the
  first U bytes) are stored on the first <deviceID, objectID>, the
  second stripe unit on the second <deviceID, objectID> and so forth
  until the first complete stripe.  The data layout then wraps around
  so that byte (N*U) of the file is stored on the first <deviceID,
  objectID> in the list, but starting at offset U within that object.
  The striped layout allows a client to read or write to the component
  objects in parallel to achieve high bandwidth.

  The striped map for a block device would be slightly different.
  The map is an ordered list of <deviceID, blockID, blocksize>, where
  the deviceID is rotated among a set of devices to achieve striping.

11.4 Replicated Map

  The file data is replicated on N data servers.  The map consists of
  N <deviceID, objectID> tuples.  When data is written using this map,
  it should be written to N objects in parallel.  When data is read,
  any component object can be used.

  This map type is controversial because it highlights the issues with
  error recovery.  Those issues get interesting with any scheme that
  employs redundancy.  The handling of errors (e.g., only a subset
  of replicas get updated) is outside the scope of this protocol
  extension.  Instead, it is a function of the storage protocol and
  the metadata management protocol.

11.5 Concatenated Map

  The map consists of an ordered set of N <deviceID, objectID,
  size> tuples.  Each successive tuple describes the next segment of
  the file.

11.6 Nested Map

  The nested map is used to compose more complex maps out of simpler
  ones.  The map format is an ordered set of M sub-maps, each submap
  applies to a byte range within the file and has its own type such
  as the ones introduced above.  Any level of nesting is allowed in
  order to build up complex aggregation schemes.

welch-pnfs-ops           Expires - April 2005                  [Page 20]
Internet-Draft         pNFS Operations Summary              October 2004

12. Issues

12.1 Storage Protocol Negotiation

  Clients may want to negotiate with the metadata server about
  their preferred storage protocol, and to find out what storage
  protocols the server offers.  Client can do this by querying the
  LAYOUT_CLASSES file system attribute.  They respond by specifying
  a particular layout class in their LAYOUTGET operation.

12.2 Crash recovery

  We use the existing client crash recovery and server state recovery
  mechanisms in NFSv4. This includes that layouts have associated
  layout stateids that "expire" along with the rest of the client
  state. The main new issue introduced by pNFS is that the client
  may have to do a lot of I/O in response to a layout recall.
  The client may need to remember to send RENEW ops to the server
  during this period if it were to risk not doing anything within
  the lease time. Of course, the client should only reply with its
  LAYOUTRETURN after it knows its I/O has completed.

12.3 Storage Errors

  As noted under LAYOUTRETURN, there is a need for the client to
  communicate about errors it has when accessing storage directly.

welch-pnfs-ops           Expires - April 2005                  [Page 21]
Internet-Draft         pNFS Operations Summary              October 2004

13. References

  1  Gibson et al, "pNFS Problem Statement",
    July 2004.

14. Acknowledgments

  Many members of the pNFS informal working group have helped
  considerably.  The authors would like to thank Gary Grider, Peter
  Corbett, Dave Noveck, and Peter Honeyman.  This work is inspired
  by the NASD and OSD work done by Garth Gibson.  Gary Grider of
  the national labs (LANL) has been a champion of high-performance
  parallel I/O.

15. Author's Addresses

  Brent Welch
  6520 Kaiser Drive
  Fremont, CA 94555 USA
  Phone: +1 (510) 608 7770

  Benny Halevy
  Panasas, Inc.
  1501 Reedsdale St., #400
  Pittsburgh, PA 15233 USA
  Phone: +1 (412) 323 3500

  David L. Black
  EMC Corporation
  176 South Street
  Hopkinton, MA 01748
  Phone: +1 (508) 293-7953

  Andy Adamson
  CITI University of Michigan
  519 W. William
  Ann Arbor, MI 48103-4943 USA
  Phone: +1 (734) 764-9465

  David Noveck
  Network Appliance
  375 Totten Pond Road
  Waltham, MA 02451 USA
  Phone: +1 (781) 768 5347

welch-pnfs-ops           Expires - April 2005                  [Page 22]
Internet-Draft         pNFS Operations Summary              October 2004
16. Full Copyright Notice

  Copyright (C) The Internet Society (2004).  This document is subject
  to the rights, licenses and restrictions contained in BCP 78,
  and except as set forth therein, the authors retain all their rights.

  This document and the information contained herein are provided

Intellectual Property 
  The IETF takes no position regarding the validity or scope of any    
  Intellectual Property Rights or other rights that might be claimed to 
  pertain to the implementation or use of the technology described in 
  this document or the extent to which any license 
  under such rights might or might not be available; nor does it 
  represent that it has made any independent effort to identify any 
  such rights.  Information on the procedures with respect to rights in 
  RFC documents can be found in BCP 78 and BCP 79. 
  Copies of IPR disclosures made to the IETF Secretariat and any 
  assurances of licenses to be made available, or the result of an 
  attempt made to obtain a general license or permission for the use of 
  such proprietary rights by implementers or users of this 
  specification can be obtained from the IETF on-line IPR repository at 
  The IETF invites any interested party to bring to its attention any 
  copyrights, patents or patent applications, or other proprietary 
  rights that may cover technology that may be required to implement 
  this standard.  Please address the information to the IETF at ietf- 

  Funding for the RFC Editor function is currently provided by the 
  Internet Society. 

welch-pnfs-ops           Expires - April 2005                  [Page 23]