SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RE: An IPS Transport Protocol (was A Transport Protocol Without ACK)



    > From: stewrtrs@stewart.chicago.il.us
    >
    > Any transport protocol proposal is ok. As long as it can be seen and
    > reviewed. So far I have seen only two TCP and SCTP.
    >
    > Oh, a little side note, any transport protocol proposed MUST be able to
    > show TCP like behavior in the face of congestion. And I think, IMHO, that
    > this means  that if it is NOT using RFC2581 procedures it MUST show that
    > it does backoff and share with TCP. It also has a HEAVY burden of proof to
    > show this facility at least in my mind and I would think in the
    > IESG's mind
    > as well...
    
    I will try to describe a transport protocol for iSCSI herein. This proposal
    addresses the RFC2581 for congestion management as well as queuing and
    resource management for iSCSI initiator and target devices.  I will call
    this IPS (IP Storage) Protocol which is a hybrid between FCP of fibre
    channel and TCP of IP.  The way this email is written, it is not a formal
    proposal by any stretch of imagination. I am a career adapter designer and I
    don't do RFC or windows and floors.  Therefore, in describing this IPS
    Protocol if I misuse any words that have specific meanings to RFCs, my
    sincere apology to this working group. Herein I assume the iSCSI IETF effort
    can be broken into two parts: one for mapping a SCSI request and response to
    one or more iSCSI PDUs and another for accommodating a transport protocol
    such as TCP, SCTP, or this proposed protocol, IPS.  This proposal addresses
    the second effort of the IETF.  If this assumption is wrong, hit the delete
    key now so you won't waste any more time.
    
    1. The Needs
    The speed of light travels about 5 us per kilometer or 8 us per mile.  With
    3000 miles between New York and Los Angeles, the Round Trip Time (RTT) is
    3000 x 8 x 2, or 48 msec,, not counting the queuings and delays in the
    switches and routers.  Comparing to the latency of just a few microseconds
    on locally attached devices, to make iSCSI device a meaningful alternative,
    it must have an appropriate transport protocol that deals with the long
    latency.  Furthermore, the congestion of the Internet Network that drops and
    duplicates datagrams demands an efficient and reliable detection of error
    and retransmission.  Finally, given TCP/IP is a well accepted and proven
    transport protocol, iSCSI must support TCP/IP.
    
    2. Executive Summary
    For those do not have time to read this long posting, this IPS proposal
    describes the processing -- both creating and parsing -- of an iSCSI PDU
    encapsulated within an Internet TCP/IP datagram.  Hence the proposal
    complements the current IETF effort that defines the iSCSI PDUs.  An iSCSI
    PDU starts with a media header such as Ethernet or Fibre Channel, followed
    by an IP header, an TCP header, an iSCSI header, and, finally, the data
    payload with CRC.  An iSCSI service provider, either a iSCSI driver running
    on top of a simple old fashion NIC adapter or a sophisticated
    fiber-channel-like-iSCSI adapter with large amount of microcode and local
    memory, will perform the protocol processing.  This proposal describes the
    processing -- the semantics -- that solves the iSCSI needs above.   Since,
    the iSCSI PDU has a TCP/IP header, this proposal does not preclude the using
    of TCP/IP protocol for iSCSI.  This IPS protocol addresses congestion
    management like that in RFC2581 that describes a "good citizenship behavior"
    of a protocol on how to start and to retransmit data segments on a busy
    network.  This protocol modifies the RFC2581 to deal with long Internet
    latency of delivery of datagrams.  The protocol ensures efficient and yet
    reliable delivery.  By stealing some ideas from fibre channel adapters,
    which is now targeted for 50,000 IOs per second, this protocol also
    describes the creation of an exchange table which deals with thousands of
    concurrent iSCSI requests and responses without the problems of deadlock and
    resource allocations.
    
    3. Terms
    A segment -- a term used in the RFC2581, same as an iSCSI PDU
    ACK and ACK-0 -- an acknowledge PDU.  Refer ACK-0 to the FC-PH spec.
    An Exchange -- roughly like a session defined by the working group
           except it is executed on a single TCP connection
    An iSCSI Request/Response Message -- an APL to an iSCSI Provider describing
           sending/receiving an iSCSI request/response.
    BB-Credit -- refer to the FC-PH spec.
    cwnd and rwnd -- Congestion and Receive Windows, terms used in the RFC2581.
           They have the same value in this protocol
    SOCKET, CONNECT, BIND Systems Calls -- same meaning as the TCP/IP
    implementation
    Delay Constant -- the time units between transfers of sequences
    Data Descriptors -- in the form of a memory handle or a scatter/gather list
           in an iSCSI request/response for sending/receiving segments
    DMA -- Direct Memory Access to transfer iSCSI data payloads to/from iSCSI
           application software using the data descriptors inside a iSCSI
           request/response message
    EE-Credit -- refer to the FC-PH spec.
    Exchange ID -- OX_ID and RX_ID, please refer to the FC-PH spec.
    iSCSI Provider -- an iSCSI driver together with an old fashion NIC adapter
           or a modern superfast iSCSI adapter
    iSCSI PDU -- as defined by this working group
    Sequences -- an exchange has many sequences each of which has many segments
    Tag Queuing -- refer to the SCSI SAM spec.
    TCP Connection -- A pair of IP-Address and TCP port that uniquely identifies
           an application process that transmits/receives an iSCSI PDU.
    Retransmission -- A part of error recovery to retransmit a lost sequence
    
    4. Congestion Management
    The RFC2581 is not specific to TCP.  It should be used by every transport
    protocol sharing the network, although the authors of the RFC based their
    experiments and conclusions using TCP.  The RFC covers four specific topics:
    slow start, congestion avoidance, fast transmit, and fast recovery.  If
    other protocols on the network are not following the same rules, while a TCP
    client/server using slow start waits patiently on a congested network, other
    protocols will continue flood the network with new data segments, hence,
    defeating the congestion management.  The RFC2581 definitely is not the best
    thing for a network with extreme long latency.   Let me use an example to
    describe the problem before describing the solution.  Assume the latency
    delay or round-trip time of two iSCSI devices between N.Y. and L.A. is 50
    msec.  In addition, assume data segment is 2K.  Using the slow start
    algorithm of the RFC2581, a sender only sends two segments at beginning and
    waits for the ACKs before increase its cwnd.  After waiting 50 msec, the
    sender increases its cwnd to 3, sends 3 segments, and waits again.  On a
    not-so-busy network, to send one MB of data or 500 segments, the sender
    being a good citizen on the network, will repeat the wait 32 times to send
    all 500 2K segments.  The total time for delivering one MB of data is 50
    msec times 32, or about 1.6 seconds.  One may argue that given enough time,
    the cwnd can be increased to 500 and the whole one MB of data can be
    transferred once.  However, any lost packet or out-of-order delivery --
    which we assume happening often and is the reason for having the slow
    start -- the sender seeing the duplicated ACKs slows down immediately by
    reducing cwnd quickly.  Furthermore, the RFC also does slow start after some
    idle time.  This is because the network congestion status is no longer known
    after some idle time.  In this super fast Internet era, when we are
    designing adapters to process each fibre channel request in 20 microseconds
    and 50,000 IO's per second, the 50 msec wait and 1.6 sec for moving one MB
    of data using slow start simply sounds awful.  This problem becomes much
    worse when the MTU is not 2K but reduced to 512 bytes.  In this case, there
    are 2000 segments for a one MB transfer.  I don't need to challenge your
    imagination when the iSCSI is used to back up one TB of data.
    
    Now the solution. In the IPS protocol breaks down the 1MB data into 25
    20K-sequences.  Each sequence has ten 2K segments.  Each sequence will be
    acknowledged individually.  We define a Delay Constant between the transfer
    of two consecutive sequences.  On a not-so-busy network, the delay should be
    zero.  Hence, the sender sends all 25 sequences or 500 segments without
    delay.  Using a 1 Gb adapter, the whole 1 MB of data goes out in 10 msec.
    25 msec later they arrive at the destination.  Each sequence is acknowledged
    individually.  25 msec later, all 25 ACKs come back to the sender.  The
    whole one 1 MB is transferred in 60 msec, not 1.6 sec.  Comparing to the 10
    msec transfer on a local network, 60 msec is not so great, but it is the
    best we can do because the 50 msec delay is contributed by the
    speed-of-light.  A thousand TCP connections will not rid the 50 msec delay.
    If we decide not to keep this IPS Protocol simple and stupid, we can make
    the ACK a little more specific by specifying which particular segment is
    missing.  Only missing segments are retransmitted.  We can even bundle the
    missing segments from different sequences by defining a transmitted sequence
    which contains only retransmitted segments.  As an adapter design, I prefer
    keeping it simple and stupid by retransmit the whole sequence.  Instead, we
    fine tuning it by changing the size of a sequence.  When retransmit is
    necessary, the sender will act as a good citizen by increase the delay
    constant between sequences.  On successful transmit, the sender will
    decrease the delay constant.  Exactly how aggressively should we back away
    from a congested network -- by a large jump of the delay constant -- will be
    left for simulation.  I do believe the result will depend on the segment
    sizes the latency values.  Note, the performance of this protocol does not
    depend on the MTU size because it is designed to stream the segments.
    
    Notice, this IPS protocol takes an optimistic view about the Internet
    traffic, i.e., assuming the traffic is light.  If not true, it backs off
    quickly.  I believe this is necessary for a network with long latency delay
    because we can't afford the slow start.  A second thing about this IPS
    Protocol is that one ACK is generated on each sequence instead of each
    segment.  Using the bulk ACK on a busy network with long latency reduces the
    ACK traffic.  The third thing about the IPS is it assumes the receiver is
    very intelligent to generate the bulk ACK.  Of course, if an ACK is missing,
    the missing sequence is detected by timeout and must be retransmitted.  We
    should also use the ACK-0 of the fibre channel to signal the sender that
    everything is OK even some ACKs are not received by the sender.  ACK-0 will
    greatly reduce the retransmission by a missing ACK.
    
    5. Queuing Management
    An IPS request/response message is transaction-oriented, i.e. the whole
    "iSCSI session" is described in a single request/response message to the IPS
    provider. Within a request, SCSI command, one or more endpoints, i.e. IP
    address and TCP port pairs, and data descriptors in the form of a memory
    handle or a scatter/gather list, and other needed variables are provided.
    The IPS request/response message is sent to a iSCSI provider that is
    responsible for creating outgoing PDUs and receiving incoming PDUs.  To the
    provider, each message is an exchange between two endpoints.  The initiator
    give it an OX_ID and the target gives it a RX_ID.  Each exchange is executed
    atomically, i.e. the IPS provider is responsible for sequencing the SCSI
    command, data, and status.  There are no command queuing or head-of-queue
    deadlock problems.  This is because the IPS provider creates a giant
    exchange table.  Whenever a data PDU is received, using OX_ID or RX_ID to
    find the exchange, the IPS refers to the exchange table to determine what to
    do.  Data PDUs are served on demand, hence, no head-of-queue blocking
    problem.  Outgoing data PDUs are broken down into sequences.  After the
    transfer of each sequence the IPS provider can switch to another exchange to
    avoid long delay behind a large exchange.  For those who familiar with a
    fibre channel adapter, executing an IPS request is like executing an FCP
    request, except for the congestion management described earlier.  If more
    than one endpoint is in the iSCSI request/response message, the IPS provider
    can take the liberty of selecting another endpoint to transmit or
    retransmit.  However, when a different endpoint is used, the whole message,
    or session, is repeated.  A Task Management PDU like ABORT may be needed to
    avoid confusion on the receiver side.
    
    I do appreciate that some people will implement the iSCSI provider in the
    old fashion stream-oriented TCP protocol instead of this IPS protocol.  I
    don't have any problem for the working group in trying to solve their
    problems.  Personally, I will never implement an iSCSI provider using TCP
    stream oriented protocol.  I will implement the aforementioned congestion
    management in a fibre channel adapter today as an IPS provider.  As long as
    an IPS provider deals with the PDU's correctly, it should always
    interoperate with another node which uses TCP stream oriented protocol.  Of
    course, how do two endpoints generate the ACKs must be uniform.  In dealing
    with an IPS provider using TCP, the concept of transfer sequence disappear.
    Each sequence is a single segment which is ACK'ed individually.  By the way,
    I will never consider multiple TCP paths to reduce latency time because the
    IPS provider like a fibre channel adapter is targeted to deliver 50,000 IOs
    per second going to 100,000 IOs in the near future.  The context switch time
    between multiple TCP paths will make the 100,000 IOs impossible.  Keeping
    the segments streaming on the same connection path is the only good solution
    for long latency delay.
    
    6. Resource Management
    There are three layers of resource management.  First, the BB credit takes
    care of two nodes connecting point-to-point or on the same arbitrated loop.
    Using BB credit, one node can never overrun the incoming buffer of another
    node.  This does not apply to iSCSI device connecting to Ethernet due to the
    collision avoidance protocol, i.e, one has no control of the sender of the
    incoming segments.  Second, the EE credit is equivalent to the rwnd variable
    of the RFC2581.  It manages how many segments a receiver is willing to
    receive.  The EE credit concept is unpractical on a network with long
    latency.  Using the example of the one MB transfer earlier, if the EE credit
    is small, the sender must wait after its EE credit is exhausted.  Only ACKs
    can replenish the EE credit.  The wait is 50 msec each time.  In fact, it is
    imperative for an IPS provider to use DMA to empty incoming segments from
    its buffer in lieu of EE-credit management.  Using EE-credit to slow down
    the sender on a network with long latency makes the performance unpractical.
    Finally, the third, the number of SCSI commands can be sent to a target
    device is governed by the SCSI tag queuing concept.  The initiator is always
    aware of the number of SCSI commands can be sent to a target.  It simply
    does not make sense to send ten commands to a target who can only accept
    five.  After command #6 is rejected with queue busy, the is no guarantee
    that command #7 will also be rejected.  This is because command #1 could be
    completed before #7 arrives.  If #7 is not rejected, then, #6 and #7 will be
    executed out of order and not acceptable.  With the exception of SCSI tag
    queuing, an IPS provider can not use either BB or EE credits.  It must use
    DMA to empty the incoming segments quickly.  For those who implement the IPS
    provider in TCP, EE credit can be used.  Then, one must pay the price of a
    network with long latency delay.  Last, but not least, in the IPS protocol,
    an IPS provider never needs to allocate cache memory to receive PDUs.  This
    is because it uses the memory supplied by application software with the data
    descriptor in the request/response message.  Each message sets up one
    exchange table entry which saves the data descriptor.  When a PDU is
    received without an exchange table entry, the segment is unsolicited and
    thrown away. In other words, the IPS provider is not responsible for an
    incoming segment when there is no application program waiting for it. This
    is like the TCP receiving an incoming segment which has an invalid port
    number.  Like setting up the TCP port, an application program must always
    instruct the IPS provider to create an exchange table entry to receive
    incoming iSCSI segments.
    
    It is OK to send data to a target right after a SCSI command without waiting
    the Read-To-Transfer from the target.  This is known as streaming transfer.
    When a target uses a IPS message to receive a SCSI command, it can also have
    the option to provide data descriptors to receive the streamed data without
    the need of returning R2T first.  The streaming transfer is OK'ed when a
    connection is made.
    
    7. Multiple NICs
    We certainly do not exclude multiple IPS providers.  I believe a wedge
    driver sits on top of the IPS providers may choose different one for load
    balance as long as they can reach the same destination.  Note, since each
    IPS request/response message is executed atomically by one IPS provider,
    there is no synchronization between them.  One the receiving end, the
    application software can set up multiple IPS provider to receive incoming
    requests.  I don't know enough about this area to make meaningful comments.
    
    8. Multiple Paths to Same Destination
    The IPS protocol uses the SOCKET, CONNECT, and BIND system calls to make a
    TCP connection.  It is assumed that when there are multiple IP addresses to
    reach a same destination, the SOCKET data structure will provide such
    information which in turn will be given to the IPS provider for
    retransmission consideration.
    
    
    Y.P. Cheng, CTO, ConnectCom Solutions Corp.
    
    


Home

Last updated: Tue Sep 04 01:07:07 2001
6315 messages in chronological order