SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RDMA over TCP (Was Re: VI (Was: Avoiding deadlock in iSCSI))


    • To: <ips@ece.cmu.edu>
    • Subject: RDMA over TCP (Was Re: VI (Was: Avoiding deadlock in iSCSI))
    • From: "Jim Williams" <jimw@giganet.com>
    • Date: Mon, 25 Sep 2000 10:41:36 -0400
    • Content-Transfer-Encoding: quoted-printable
    • Content-Type: text/plain;charset="iso-8859-1"
    • Sender: owner-ips@ece.cmu.edu

    Stephen Byan wrote:
    
    >Michael Krause [mailto:krause@cup.hp.com] wrote:
    >
    >> [snip]
    >> RDMA != VI though VI does use RDMA technologies.
    >
    >Agreed. It is an example
    >
    >> Prefer to see discussion focused on what RDMA operations are
    >> required, what are the error and ordering requirements, etc.
    > [snip]
    >> [snip]
    >> As such, a general
    >> purpose RDMA solution which operates over TCP/IP is the optimal
    >> solution to pursue since it will lead to the broadest industry
    >> and customer adoption rate.
    >
    >I think we are in complete agreement.
    >
    >Regards,
    >-Steve
    
    
    In response to this I would offer the following proposal with
    the caution that it is very preliminary and has not been
    analyzed or reviewed.  But I thought it might be worth posting
    in order to see what the general response to this approach is.
    
    ##############################################################
    
                             RDMA / TCP
    
    
    1.  Abstract
    
    This document describes a format for encapsulating RDMA (remote direct
    memory access) information within a TCP data stream.  No changes or
    modification to TCP of any sort are required.  This is not intended to
    be a protocol, but rather a common format that may be shared by
    multiple client protocols, for instance VI/TCP and iSCSI.  By using a
    common format it is hoped that design of NICs supporting these multiple
    protocols can be simplified.
    
    Sufficient information is included in the RDMA message format to allow
    determination of the protocol message units, as will as the ability to
    process an incoming RDMA request even if previous packets are missing
    and awaiting retransmission.  In addition a CRC-32 is included in each
    segment to enhance the checksum coverage included in TCP.
    
    
    
    2.  Overview
    
    Data transfers consist of a sequence of messages.  Each message is of
    one of four types: Send, RDMA_write, RDMA_Read_Request, and
    RDMA_Read_Response.  The maximum size of a message is approximately
    2^32.  Each message is divided into one or more segments.  It is
    RECOMMENDED that each TCP segment contain exactly one RDMA segment.
    The receive end of the connection cannot assume any alignment between
    the RDMA segments and TCP segments, however a receiver SHOULD optimize
    performance for the case where each TCP segment contains exactly one
    RDMA segment.
    
    
    3.   RDMA Segment Format
    
    The format of an RDMA segment depends on the message type.  Shown below
    are the formats for the four different types of messages.  All
    multibyte formats are to be represented in network byte order (i.e.,
    big-endian).
    
    3.1     Send and RDMA_Read_Response Message Type:
    
    
     |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
     |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
     +---------------+---------------+---------------+---------------+
     |    Version    |  res  |B|E|typ|         Segment Length        |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +                        Connection ID                          +
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                         Message Number                        |
     +---------------+---------------+---------------+---------------+
     |             order             |     res.      |      CLEN     |
     +---------------+---------------+---------------+---------------+
     |                          Data Offset                          |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     |                       Control Data                            |
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     |                                                               |
     |                          Payload Data                         |
     |                                                               |
     |                                                               |
     +                               +---------------+---------------+
     |                               |            Padding            |
     +---------------+---------------+---------------+---------------+
     |                            CRC-32                             |
     +---------------+---------------+---------------+---------------+
    
    
    
    
    
    
    
    
    3.2     RDMA_Write Message Type:
    
    
     |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
     |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
     +---------------+---------------+---------------+---------------+
     |    Version    |  res  |B|E|typ|         Segment Length        |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +                        Connection ID                          +
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                         Message Number                        |
     +---------------+---------------+---------------+---------------+
     |             order             |     res.      |      CLEN     |
     +---------------+---------------+---------------+---------------+
     |                         RDMA Buffer ID                        |
     +---------------+---------------+---------------+---------------+
     |                       RDMA Buffer offset                      |
     +---------------+---------------+---------------+---------------+
     |                          RDMA Length                          |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     |                       Control Data                            |
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     |                          Payload Data                         |
     |                                                               |
     |                                                               |
     +                               +---------------+---------------+
     |                               |            Padding            |
     +---------------+---------------+---------------+---------------+
     |                            CRC-32                             |
     +---------------+---------------+---------------+---------------+
    
    
    3.3     RDMA_Read_Request Message Type:
    
    
     |    Byte 0     |    Byte 1     |    Byte 2     |    Byte 3     |
     |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|
     +---------------+---------------+---------------+---------------+
     |    Version    |  res  |B|E|typ|         Segment Length        |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     +                        Connection ID                          +
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                         Message Number                        |
     +---------------+---------------+---------------+---------------+
     |             order             |     res.      |      CLEN     |
     +---------------+---------------+---------------+---------------+
     |                         RDMA Buffer ID                        |
     +---------------+---------------+---------------+---------------+
     |                       RDMA Buffer offset                      |
     +---------------+---------------+---------------+---------------+
     |                           RDMA Length                         |
     +---------------+---------------+---------------+---------------+
     |                                                               |
     |                       Control Data                            |
     |                                                               |
     +---------------+---------------+---------------+---------------+
     |                            CRC-32                             |
     +---------------+---------------+---------------+---------------+
    
    Note that RDMA_Read_Request messages always consist of exactly one
    segment and contain no payload data.
    
    
    3.4     Segment Field Definitions
    
    
         Version:
    
    The version number refers to the version of the RDMA format, not to
    that of the client protocol.  This document defines version 1, so this
    field should contain 0x1.
    
    
         Res:
    
    These four bits are reserved and not used by the RDMA mechanism.  They
    may be used by the client protocol.
    
    
         B, E:
    
    These are the begin and end bits.  The indicate that this segment is
    the beginning or end, respectively, of the message to which is belongs.
    Either, both, or neither of these bit may be set for a given segment.
    
    
         Type:
    
    Four Types of messages are supported.  These are 0 – Send, 1 –
    RDMA_Write, 2 – RDMA_Read_Request, and 3- RDMA_Read_Response.
    
    
         Segment Length:
    
    This field contains the length of the RDMA segment in bytes.  This
    length includes the RDMA segment header and payload up to, but not
    including, the padding and CRC.
    
    
         Connection ID:
    
    The Connection ID is a 64 bit value selected at random.  This value is
    selected by the client side of the connection and included in the first
    message segment sent over the connection.  The same value is then used
    for all subsequent segments sent in either direction.  It is
    RECOMMENDED that a secure un-guessable random number generator be used
    to generate these values.  The Connection ID serves two purposes.  It
    allows framing to be recovered after a dropped segment, and it provides
    security against blind attacks.
    
    
         Message Number:
    
    For Send, RDMA_Write, and RDMA_Read_Request messages, the message
    number is assigned sequentially for each message, wrapping from 2^32-1
    to 0.  All three types of messages are part of a single sequence.  For
    RDMA_Read_Response messages, the message number should be set equal to
    the originating RDMA_Read_Request.  The initial message number for the
    first (non RDMA_Read_Response) message sent in each direction SHOULD be
    selected at random.
    
    
         Order:
    
    This 16 bit field defines the ordering requirements on a message.  A
    value of "N" in the order field indicates that the payload data may not
    be read from or written to its ultimate source or destination until all
    EXCEPT the preceding N messages have been processed.  The value of
    0xFFFF (all one bits) is reserved to indicate that the operation may be
    done immediately when received unconditionally.
    
    As an example of the above, if a sequence of RDMA_Write and
    RDMA_Read_request messages are received with an order field containing
    zero, then the operations must be done in the order received, however
    the individual segments of a given RDMA_Write message may be written to
    the target buffer in arbitrary order.  If an RDMA_Write or
    RDMA_Read_Request message is received with an order field containing 5
    and a message number of 97, then message number 91 and all previous
    messages must have been processed.
    
    For a Send message, out of order processing implies that the client
    protocol actually receives these messages out of order.  If, for
    instance, the send messages contain commands, the value in the order
    field of these messages should not allow reordering unless the client
    protocol is allowed to process the contained commands out of order.
    
    
         CLEN, control data:
    
    This 8 bit CLEN field defines the length of the control data included
    with a message.  The value of CLEN is the number of 32 bit words of
    control data included.  The meaning of the control data is determined
    by the client protocol, but the significance is that it is not part of
    the RDMA transfer and should not be written to the RDMA target or read
    response buffer.  For Send messages, the control data designation is
    only for the convenience of the client protocol, and the only
    difference between control and payload data is that control data is not
    counted towards the computation of the data offset.  Typically the
    control data will contain header information for the client protocol in
    addition to that provided by the RDMA segment format.
    
    
         Data Offset:
    
    This field is contained in Send and RDMA_Read_Response messages.  The
    first segment of a message must have a data offset of zero.  In each
    subsequent segment of the message, the offset will be equal to the
    number of payload bytes sent in all previous segments of the message.
    
    
         RDMA Buffer ID, RDMA Buffer Offset:
    
    These values determine the target address of an RDMA read or write.
    The value of the RDMA Buffer ID must be constant across all segments of
    an RDMA_Write.  The value of the RDMA Buffer Offset can be anything in
    the first segment of an RDMA_Write, but must be incremented in each
    subsequent segment by the number of bytes transferred.
    
    The exact interpretation of these values is determined by the client
    protocol, however it is expected that the RDMA Buffer ID, possibly
    combined with some bits from the RDMA buffer offset, will be used as an
    index into a table of buffers.  The actual data transfer will occur to
    or from this buffer starting at an offset determined by the RDMA buffer
    offset, or some bits extracted from the RDMA buffer offset.
    
    Unfortunately because of the differing addressing models used by
    different client protocols, it is not possible to exactly specify how
    buffer ID and offset are resolved to a physical address in the NIC.  It
    is hoped, however, that even with this protocol dependent feature, the
    commonality in the RDMA format should allow more efficient
    implementation of protocol accelerating NICs that support multiple
    protocols requiring RDMA.
    
    
         RDMA Length:
    
    Indicates the number of bytes to be transferred in an RDMA operation.
    In the case of an RDMA_Write, this is the total number of bytes in the
    entire message, and the same value must be repeated in each segment of
    the message.
    
    
         Padding:
    
    Between 0 and 3 bytes of padding are used to make the segment a
    multiple of 4 bytes in length.  The padding MUST be set to zero by the
    sender and ignored by the receiver.
    
    
         CRC-32
    
    The CRC-32 is calculated across the entire segment (but does not cover
    other segments of the same message, or lower level protocol headers
    such as TCP).  The algorithm used to calculate the CRC is exactly that
    used for the ethernet CRC except that a different generator polynomial
    is used.  The generator polynomial for the RDMA CRC is
    
      x^32 + x^31 + x^30 + x^28 + x^27 + x^25 + x^24 + x^22 +
                x^21 + x^20 + x^16 + x^10 + x^9  + x^6  + 1.
    
    This polynomial is the standard ethernet polynomial with a left-right
    reversal.  (Or mathematically, substitute y = x^-1 and multiply by
    y^32).  In hex format with the x^32 term removed, this is 0xDB710641.
    It is desirable to use a different polynomial than ethernet so that
    when an RDMA segment is carried in an ethernet packet, the combined
    protection of two different polynomials is achieved, rather than
    checking twice with the same polynomial.
    
    [ Add reverence for ethernet CRC and detailed computation algorithm. ]
    
    
    4.     Segments and Messages
    
    The four types of messages are divided into two groups.  The first
    group consists of Send, RDMA_Write, and RDMA_Read_Request messages, and
    the second group consists of RDMA_Read_Response messages.  Within each
    of these two groups, all messages must be sent in order.  Each message
    is divided into one or more segments, and all the segments of a
    particular message are sent in order.  All segments of one message must
    be sent before the first segment of the next message is sent.  However
    between the two groups, segments may be interleaved arbitrarily.
    
    [ Show example of a series of segments following these rules. ]
    
    
    5.   Determination of Framing
    
    The beginning of the first segment on a TCP connection occurs of course
    starting with the first data byte sent.  Given the start of a segment,
    the start of the next segment can be determined by noting the length
    field in the header of the segment, rounding up to the next multiple of
    four (to account for padding) and adding four (for the CRC) and moving
    forward that many bytes in the TCP data stream.  In this manner, the
    beginning of each segment can be determined from the last.
    
    When a packet is dropped, however, it is desirable to recover framing
    on subsequent segments so that they might be processed by the NIC and
    their payload data placed directly in its ultimate destination.
    
    The recommended method for doing this is to assume that the RDMA
    segment is aligned with a TCP segment, and verify the correctness of
    the header fields of the RDMA segment.  If these header fields are not
    correct, then the NIC should fall back to buffering the packet until it
    can be processed in order.  In particular, the 64 bit connection ID
    field was selected at random, so the only way that could match payload
    data is by pure chance.  It can be easily shown that even if a
    miss-aligned packet arrives every 2us, the MTBF of mistakenly
    identifying this as an aligned packet is greater than one million
    years.  Checking the message number, CRC, and other fields only
    enhances the confidence in this determination.
    
    
    
    
    
    
    
    
    
    
    


Home

Last updated: Tue Sep 04 01:07:06 2001
6315 messages in chronological order