[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: iSCSI/iWARP drafts and flow control

    As Mike points out, the CmdSN-based flow control
    in iSCSI is relevant here.  Let me note that the design 
    team behind the current iSER draft considered this topic 
    in great detail, but I can now clearly see that the draft 
    unfortunately does not capture the design rationale very well.
    iSCSI does not provide a PDU-level positive flow control
    but instead relies on the CmdSN feature, from which most
    of the iSCSI (what DA/iSER call as the) "control-type" PDU traffic 
    can be precisely estimated (note that only control-type PDUs
    are candidates for Send Messages and thus relevant to this 
    discussion).  However, it turns out that there are certain
    opcode types that are used very rarely that are not governed
    by the CmdSN-based flow control - immediate commands,
    SNACK, unsolicited NOP-In, Reject, and Async Messages. 
    Note that the above does not include the unsolicited Data-out
    PDUs since the worst case number of these is precisely known from
    CmdSN, but the worst case buffer provisioning for these would 
    be both unnecessary and extremely expensive in reality.
    The iSER design team thus believed that most storage implementations
    will use buffer pools to deal with this reality (as they have always 
    been), and the rare "fringe" opcode types mentioned above could 
    easily be dealt with in the statistical provisioning scheme of things, being
    so rare and infrequent.
    Despite this belief (in fact, even before we are convinced of this approach), 
    we did a diligent analysis of a Send Message flow control protocol for iSER 
    - the ultimate conclusion was that it's way too much overhead to run this 
    protocol, it's slow-to-respond to changing I/O loads, reclaiming of credits 
    is a burdensome process, requires RTT delays to announce new credits etc. 
    I believe the approach adopted in the current iSER draft is appropriate,
    we do however need to polish the flow control discussion to include 
    some of the design rationale.
    Mallikarjun Chadalapaka
    Networked Storage Architecture
    Network Storage Solutions
    Hewlett-Packard MS 5668 
    Roseville CA 95747
    ----- Original Message ----- 
    From: "Mike Ko" <>
    To: "Caitlin Bestler" <>
    Cc: <>
    Sent: Saturday, July 26, 2003 1:57 PM
    Subject: Re:iSCSI/iWARP drafts and flow control
    > In iSER, we expect the flow control to be regulated by the Command 
    > Numbering mechanism in iSCSI.  In other words, since the queuing capacity 
    > of the receiving iSCSI layer is MaxCmdSN - ExpCmdSN + 1, the receiving 
    > iSER layer can use this information to determine the minimum number of 
    > untagged buffers.  In addition, it needs to provision a sufficient number 
    > of untagged buffers to allow enough time for the iSER layer to respond to 
    > incoming immediate commands, asynchronous messages, etc., and replenish 
    > the buffers.  The use of a buffer pool shared across multiple connections 
    > will allow the iSER layer to replenish the buffers on a statistical basis.
    > Mike Ko
    > IBM Almaden Research
    > San Jose, CA 95120
    > Sent by:
    > To:
    > cc: 
    > Subject:        Re:iSCSI/iWARP drafts and flow control
    > The proposed mapping of iSCSI onto iWARP offers an
    > inadequate solution to the problem of flow control.
    > iWARP shifts responsibility for flow control to the ULP. In
    > doing so, it allows ULP-specific pacing based upon number
    > of requests-in-flight rather than relying the bottleneck of
    > transport buffering to flow control the application. The
    > session is no longer throttled by the availability of
    > buffers suitable for any message. This topic is covered in
    > section 4.5 of the RDMAP/DDP Applicability statement
    > (
    > There are two excellent examples of ULP solutions to pacing
    > untagged messages: DAFS and the mapping of RPC over iWARP
    > for NFS. The latter offers the following section on flow
    > control:
    > 3.3.  Flow Control
    > It is critical to provide flow control for an RDMA
    > connection.  RDMA receive operations will fail if a
    > pre-posted receive buffer is not available to accept
    > an incoming RDMA Send.  Such errors are fatal to the
    > connection. This is a departure from conventional
    > TCP/IP networking where buffers are allocated
    > dynamically on an as-needed basis, and pre-posting is
    > not required.
    > It is not practical to provide for fixed credit limits
    > at the RPC server.  Fixed limits scale poorly, since
    > posted buffers are dedicated to the associated
    > connection until consumed by receive operations.
    > Additionally for protocol correctness, the server must
    > be able to reply whether or not a new buffer can be
    > posted to accept future receives.
    > Flow control is implemented as a simple request/grant
    > protocol in the transport header associated with each
    > RPC message.  The transport header for RPC CALL
    > messages contains a requested credit value for the
    > server, which may be dynamically adjusted by the
    > caller to match its expected needs.  The transport
    > header for the RPC REPLY messages provide the granted
    > result, which may have any value except it may not be
    > zero when no in-progress operations are present at the
    > server, since such a value would result in deadlock.
    > The value may be adjusted up or down at each
    > opportunity to match the server's needs or policies.
    > While RPC CALLs may complete in any order, the current
    > flow control limit at the RPC server is known to the
    > RPC client from the Send ordering properties.  It is
    > always the most recent server granted credits minus
    > the number of requests in flight.
    > I believe this is quite a contrast with the iSCSI/iWARP proposal:
    > 10.1 Flow Control for RDMA Send Message Types
    > RDMAP Send Message Types are used by the iSER Layer to
    > transfer iSCSI control-type PDUs.  Each RDMAP Send
    > Message Type consumes an Untagged Buffer at the Data
    > Sink.  However, neither the RDMAP layer nor the iSER
    > Layer provides an explicit flow control mechanism for
    > the RDMAP Send Message Types.  Therefore, the iSER
    > Layer SHOULD provision enough Untagged buffers for
    > handling incoming RDMAP Send Message Types to prevent
    > a buffer underrun condition at the RDMAP layer. If a
    > buffer underrun happens, it may result in the
    > termination of the connection.  An implementation may
    > choose to satisfy this requirement by using a common
    > buffer pool shared across multiple connections, with
    > usage limits on a per connection basis and usage
    > limits on the buffer pool itself.  In such an
    > implementation, exceeding the buffer usage limit for a
    > connection or the buffer pool itself may trigger
    > interventions from the iSER Layer to replenish the
    > buffer pool and/or to isolate the connection causing
    > the problem.
    > Stating that the iSER Layer "SHOULD" provision enough
    > Untagged buffers is an interesting use of the IETF
    > "SHOULD". Implementations are *guaranteed* to have a
    > valid reason to break the "SHOULD", they do not have
    > enough information to comply. The Upper Layer Protocol
    > has failed to provide it.
    > How is the target supposed to estimate how many
    > untagged messages the initiator will presume it is
    > capable of handling? Or vise versa? How? Provision
    > enough buffers to match your physical line rate under
    > the worst case scenarios? Even if you're an economy
    > model? Guess? Keep a table by model number? Limit
    > yourself to one untagged message in flight? Even if
    > you are supposed to be a high performance model?
    > Keep trying until you crash the connection?
    > True interoperability is not based upon tweaking or
    > fine-tuning to match the peers. Peers work together
    > because the protocol has enabled any peer to work
    > with any other compliant peer. Period. Guestimating
    > has nothing to do with it.
    > Fortunately, establishing a credit protocol that is
    > compatible with normal iSCSI interactions is easily
    > done. Generically an RDMA-capable ULP flow control
    > strategy requires three things:
    > 1) An initial credit level. This can be established
    > during connection/stream establishment just as
    > is proposed for RDMA Read Credits.
    > 2) A credit is consumed for each untagged message
    > sent, exactly as sending each RDMA Read Request
    > consumes an RDMA Read credit.
    > 3) The ULP reply restores credits. With RDMA Reads
    > this is a simple one-to-one process. DAFS also
    > uses has each reply replenish the credit that
    > the request it is responding to drained. The
    > NFS/RPC protocol allows the RPC layer to
    > explicitly vary the number of credits
    > restored in each untagged message.
    > The only special requirement that I can see is that
    > there may be a sequence of untagged messages that are
    > not individually acknowledged. That can be taken care
    > of by the following rules:
    > -- A ULP response to a ULP request implies that all
    > prior ULP requests have been processed, even if
    > they did not warrant an explicit response.
    > -- A ULP response restores credits for itself and
    > for any other "phantom" responses that it implies.
    > -- If a ULP needs to send a sequence of untagged
    > messages that will not be acknowledge which will
    > drain the credits, it needs to insert an untagged
    > message that will be acknowledge. Any form of
    > echoed NOP or Ping could be used.
    > Caitlin Bestler - -


Last updated: Tue Aug 05 12:46:09 2003
12771 messages in chronological order