[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re:iSCSI/iWARP drafts and flow control

    • To: Caitlin Bestler <>
    • Subject: Re:iSCSI/iWARP drafts and flow control
    • From: Mike Ko <>
    • Date: Sat, 26 Jul 2003 13:57:40 -0700
    • Cc:
    • Content-Type: text/plain; charset="us-ascii"
    • Delivered-To:
    • Delivered-To:
    • Delivered-To:
    • Delivered-To:
    • Importance: Normal
    • Sender:

    In iSER, we expect the flow control to be regulated by the Command 
    Numbering mechanism in iSCSI.  In other words, since the queuing capacity 
    of the receiving iSCSI layer is MaxCmdSN - ExpCmdSN + 1, the receiving 
    iSER layer can use this information to determine the minimum number of 
    untagged buffers.  In addition, it needs to provision a sufficient number 
    of untagged buffers to allow enough time for the iSER layer to respond to 
    incoming immediate commands, asynchronous messages, etc., and replenish 
    the buffers.  The use of a buffer pool shared across multiple connections 
    will allow the iSER layer to replenish the buffers on a statistical basis.
    Mike Ko
    IBM Almaden Research
    San Jose, CA 95120
    Sent by:
    Subject:        Re:iSCSI/iWARP drafts and flow control
    The proposed mapping of iSCSI onto iWARP offers an
    inadequate solution to the problem of flow control.
    iWARP shifts responsibility for flow control to the ULP. In
    doing so, it allows ULP-specific pacing based upon number
    of requests-in-flight rather than relying the bottleneck of
    transport buffering to flow control the application. The
    session is no longer throttled by the availability of
    buffers suitable for any message. This topic is covered in
    section 4.5 of the RDMAP/DDP Applicability statement
    There are two excellent examples of ULP solutions to pacing
    untagged messages: DAFS and the mapping of RPC over iWARP
    for NFS. The latter offers the following section on flow
    3.3.  Flow Control
    It is critical to provide flow control for an RDMA
    connection.  RDMA receive operations will fail if a
    pre-posted receive buffer is not available to accept
    an incoming RDMA Send.  Such errors are fatal to the
    connection. This is a departure from conventional
    TCP/IP networking where buffers are allocated
    dynamically on an as-needed basis, and pre-posting is
    not required.
    It is not practical to provide for fixed credit limits
    at the RPC server.  Fixed limits scale poorly, since
    posted buffers are dedicated to the associated
    connection until consumed by receive operations.
    Additionally for protocol correctness, the server must
    be able to reply whether or not a new buffer can be
    posted to accept future receives.
    Flow control is implemented as a simple request/grant
    protocol in the transport header associated with each
    RPC message.  The transport header for RPC CALL
    messages contains a requested credit value for the
    server, which may be dynamically adjusted by the
    caller to match its expected needs.  The transport
    header for the RPC REPLY messages provide the granted
    result, which may have any value except it may not be
    zero when no in-progress operations are present at the
    server, since such a value would result in deadlock.
    The value may be adjusted up or down at each
    opportunity to match the server's needs or policies.
    While RPC CALLs may complete in any order, the current
    flow control limit at the RPC server is known to the
    RPC client from the Send ordering properties.  It is
    always the most recent server granted credits minus
    the number of requests in flight.
    I believe this is quite a contrast with the iSCSI/iWARP proposal:
    10.1 Flow Control for RDMA Send Message Types
    RDMAP Send Message Types are used by the iSER Layer to
    transfer iSCSI control-type PDUs.  Each RDMAP Send
    Message Type consumes an Untagged Buffer at the Data
    Sink.  However, neither the RDMAP layer nor the iSER
    Layer provides an explicit flow control mechanism for
    the RDMAP Send Message Types.  Therefore, the iSER
    Layer SHOULD provision enough Untagged buffers for
    handling incoming RDMAP Send Message Types to prevent
    a buffer underrun condition at the RDMAP layer. If a
    buffer underrun happens, it may result in the
    termination of the connection.  An implementation may
    choose to satisfy this requirement by using a common
    buffer pool shared across multiple connections, with
    usage limits on a per connection basis and usage
    limits on the buffer pool itself.  In such an
    implementation, exceeding the buffer usage limit for a
    connection or the buffer pool itself may trigger
    interventions from the iSER Layer to replenish the
    buffer pool and/or to isolate the connection causing
    the problem.
    Stating that the iSER Layer "SHOULD" provision enough
    Untagged buffers is an interesting use of the IETF
    "SHOULD". Implementations are *guaranteed* to have a
    valid reason to break the "SHOULD", they do not have
    enough information to comply. The Upper Layer Protocol
    has failed to provide it.
    How is the target supposed to estimate how many
    untagged messages the initiator will presume it is
    capable of handling? Or vise versa? How? Provision
    enough buffers to match your physical line rate under
    the worst case scenarios? Even if you're an economy
    model? Guess? Keep a table by model number? Limit
    yourself to one untagged message in flight? Even if
    you are supposed to be a high performance model?
    Keep trying until you crash the connection?
    True interoperability is not based upon tweaking or
    fine-tuning to match the peers. Peers work together
    because the protocol has enabled any peer to work
    with any other compliant peer. Period. Guestimating
    has nothing to do with it.
    Fortunately, establishing a credit protocol that is
    compatible with normal iSCSI interactions is easily
    done. Generically an RDMA-capable ULP flow control
    strategy requires three things:
    1) An initial credit level. This can be established
    during connection/stream establishment just as
    is proposed for RDMA Read Credits.
    2) A credit is consumed for each untagged message
    sent, exactly as sending each RDMA Read Request
    consumes an RDMA Read credit.
    3) The ULP reply restores credits. With RDMA Reads
    this is a simple one-to-one process. DAFS also
    uses has each reply replenish the credit that
    the request it is responding to drained. The
    NFS/RPC protocol allows the RPC layer to
    explicitly vary the number of credits
    restored in each untagged message.
    The only special requirement that I can see is that
    there may be a sequence of untagged messages that are
    not individually acknowledged. That can be taken care
    of by the following rules:
    -- A ULP response to a ULP request implies that all
    prior ULP requests have been processed, even if
    they did not warrant an explicit response.
    -- A ULP response restores credits for itself and
    for any other "phantom" responses that it implies.
    -- If a ULP needs to send a sequence of untagged
    messages that will not be acknowledge which will
    drain the credits, it needs to insert an untagged
    message that will be acknowledge. Any form of
    echoed NOP or Ping could be used.
    Caitlin Bestler - -


Last updated: Tue Aug 05 12:46:09 2003
12771 messages in chronological order