SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Ordering Issues for VI over TCP



    This proposal pertains to the layering of VI on top of TCP
    and is being sent to the VIDF and IPS mailing lists.
    For proposed VI/TCP spec see:
    http://www.ietf.org/internet-drafts/draft-dicecco-vitcp-00.txt
    
    TCP provides a reliable, in order, transport service.  So
    when VI is layered on top of TCP, the VI layer should see
    all data from the remote end arrive in order.  However it
    has been proposed that some optimized implementations 
    may want to merge the TCP and VI layers, and do this
    in such a way that incoming packets, which are sometimes
    not in order, can be fully processed as they arrive, and
    the data written to its ultimate destination without
    needing to be buffered pending arrival of any intervening
    packets necessary to do full in order processing.
    
    To fully exploit this out of order processing, it would
    be necessary to modify the VI API definition to allow
    optionally relax ordering.  To this end I would offer
    the following proposal.  This proposal is NOT offered
    with respect to the 1.1 revision currently under
    discussion, and would only be considered in the
    2.0 time frame.  I am sending this now in hopes of
    getting some very preliminary feedback as to whether
    it makes sense to proceed in this direction.
    
    
     Two additional bits are defined in the control
     field of all transmit descriptors.  These are
     "un-ordered" and "half-ordered".  These bits
     are hint bits in that an implementation is
     free to ignore them and would still be fully
     compliant and interoperable.
    
     The meaning of these bits is as follows.  If
     the neither bit is set, then the corresponding
     operation (send, RDMA read, or RDMA write) will
     be fully ordered as is required currently by VI.
     This is true even with respect to other operations
     which may be un-ordered.
    
     If the half-ordered bit is set, then the operation
     will not be completed on the remote host until
     after all preceding operations are complete.
     However if a half-ordered operation is followed
     by an un-ordered operation, an implementation
     is free to reorder these two.  Half ordered 
     operations are useful to send completion
     messages which guarantee that previous operations
     have completed.
    
     If two un-ordered operations (with no full ordered
     operations between them) are done, then
     an implementation is free to reorder these.
    
     The following chart indicates whether ordering
     is required between two operations.
    
                                        Second Op.
       first
     op               ordered       half-ordered      un-ordered
        ------------------------------------------------------------
     ordered            yes             yes              yes
    
        half-ordered       yes             yes              no
    
        unordered          yes             yes              no
    
    
     The only exception to the above is already provided for in
     the VI spec in that an RDMA read (without the fence bit set)
     is not guaranteed to be ordered with respect to a subsequent
     send or RDMA write.  More than one of the fence bit, un-ordered
     bit, and half-ordered bit should never be set.
    
     Note that if an application does two unordered sends, R followed
     by S, and the remote end posts two receive descriptors, X and
     Y, then message R may end up in the buffer designated by Y, and
     S in that by X.  There may be cases where the application 
     may want to put sequence numbers in the application level messages,
     and put them back in order after receiving them out of order.
     The advantage to doing this at the application layer rather 
     than the TCP layer is that zero copy receives can be done
     directly to the application and only pointer reordering is
     needed.
    
     I would further propose that there be no ordering requirements
     on the order in which RDMA read responses are written to 
     memory on the requesting host.  I am not sure that this
     is currently spelled out one way or the other in the VI spec.
     The only restriction here would be that posting multiple
     RDMA reads pointing to overlapping local receive buffers would
     be unpredictable.  But this is not something that makes
     any sense to do anyway.  (Does anyone disagree with this?)
    
    The above proposed semantics are reflected in the VI/TCP protocol
    as follows.
    
     In addition to the currently defined message types of SEND,
     RDMA_WRITE, and RDMA_READ_REQUEST, three new types are
     defined: SEND_UNORDERED, RDMA_WRITE_UNORDERED, and
     RDMA_READ_REQUEST_UNORDERED.  An implementation may
     (but is not required to) use an unordered message type
     when the following two conditions are met:
    
      1.  The un-ordered bit was set in the corresponding
       transmit descriptor.
    
      2. There are no ordered messages sent for which TCP
       ACK has not yet been received.
    
     The significance of the half-ordered bit is that it allows
     subsequent un-ordered messages to be sent with the un-ordered
     message type without having to wait for the associated TCP ACK.
    
     On the receiving end, with a RDMA_WRITE_UNORDERED message type,
     the contained data may immediately be written directly to the
     buffer even if previous VI segments are missing.
     With a RDMA_READ_REQUEST_UNORDERED, the rdma read may likewise
     be done immediately.  On receiving a SEND_UNORDERED, 
     The send may be delivered, but the implementation must behave
     consistently in the case of segmented sends (i.e. if a 
     pair of sends are reordered at the receiver, all segments
     of each send must be consistently reordered).  Reordering
     of sends will most likely make sense in the presence of
     short sends which fit in a single packet.
    
     Provided the above paragraph on RDMA read responses is
     acceptable, the any received VI segment with message
     type RDMA_READ_RESPONSE may have the data written directly
     to the receive buffer without waiting for previous
     packets to arrive.
    
     This proposal assumes that VI segments and TCP segments
     are aligned in that there is exactly one complete VI
     segment contained in each TCP segments.  The mechanism
     for doing that is not discussed here.
    
    
    
    
    


Home

Last updated: Tue Sep 04 01:07:55 2001
6315 messages in chronological order