RE: iSCSI: Flow Control

To: Jim McGrath <Jim.McGrath@quantum.com>, "'David Robinson'" <David.Robinson@EBay.Sun.COM>, ips@ece.cmu.edu
Subject: RE: iSCSI: Flow Control
From: Michael Krause <krause@cup.hp.com>
Date: Sat, 23 Sep 2000 07:13:48 -0700
Content-Type: text/plain; charset="us-ascii"; format=flowed
In-Reply-To: <B7E2A2967AF7D211995B00805FA7E4DF01FE08A9@milcmsgc.qntm.com>
Sender: owner-ips@ece.cmu.edu

At 04:56 PM 9/21/00 -0700, Jim McGrath wrote:

>While memory may be getting cheaper, latency and transfer rates are getting
>higher.  We have gone from 25 m parallel SCSI buses to transcontinental
>TCP/IP connections; from 1 MB/s to 100 MB/s (and greater) transfer rates.
>These combine to make the maximum amount of data in flight that keeps the
>connection full to be growing much faster than memory cost is declining.
>(Exponential growth rates are applied to both memory cost and transmission
>speed; distance also appears to be growing very fast, although perhaps not
>exponentially).
>
>So while your argument is works if you keep the fabric size the same and
>increase the transfer rate (as it has been with the ATA interface - buffer
>costs have declined over the years), it does not work if the fabric keeps on
>growing as well.
>
>If a fabric introduces 1 ms (two orders of magnitude less than the worse
>cases I have heard) at Gbit speed, then we need 100 Kbytes of buffer space
>for a connection.  We don't have enough buffer to reserve this for all
>possible connections we could get (Fibre Channel designs could not reserve 4
>KByte for a smaller number of potential connections until recently).

Something to think about w.r.t. this problem:

RDMA semantics:
   Pros:
     - Sender only targets memory that it knows is available to use and 
thus does not inject more data than what the receiver can use.  This 
mitigate the overflow problem.

     - End-to-end ULP ACKs provide an implicit credit scheme for the 
associated target resources.

   Con:
     - One must "slice" up the target resources among a set of senders 
which can create scalability problems depending upon the resources required 
per session.  This is where SEND semantics have their advantages - one can 
use statistical access to deal with burst with minimal buffer overflow 
reserves and combine this with the idea described below.

     - RDMA support requires additional buffer access / tracking logic 
within the endnode to track the impacted memory.  The semantics are not 
difficult to implement but it is additional cost within the 
implementation.  Note: SEND semantics have DMA chain costs as well so the 
actual delta in implementation will vary depending upon the amount of 
resources one can effectively map /register at a given time.

     -  For small messages, RDMA does not always provide any cost/benefit 
advantage which is why most implementations support SEND and RDMA semantics.

>Jim
>
>PS if we actually are starting to need windows greater than 64 KBytes, is
>this a problem?  My understanding is that deployed TCP/IP products do not
>easily support extremely large windows.  This argues for spreading a single
>SCSI command across multiple TCP/IP connections for pipelining to overcome
>latency, not for bandwidth.

Large window support is not difficult to implement and is supported in many 
endnodes.  However, memory even in large endnodes is still limited and 
subject to oversubscription so if a link cannot replenish its buffers 
quickly enough, it drops the incoming packet and the transport 
retransmission / congestion management takes over and adjusts the injection 
rate.

The question is whether one would like to implement a WRED (weight random 
early detection - used today in routing elements) type of system within an 
endnode (server, storage, etc.) whereby it would drop inbound packets when 
resources are tight based on some criteria of the inbound packet (IP addr, 
QoS, TCP port, etc.).  This would allow the endnode to control which 
services should have priority when the workload approaches / exceeds the 
available buffer resources.  This would also allow one to vary the amount 
of "emergency" reserve buffers discussed by others without having to 
communicate any of this end-to-end or specify it within the architecture 
beyond the interface and drop value interpretation.

I believe there is value in creating the policy interfaces to communicate 
whether a given connection has any special policies associated with it and 
one of these policies can be where it is in the drop priority list when 
circumstances warrant it.  The actual policy would be outside of iSCSI (see 
the previous e-mail discussions about QoS and policy from this summer for 
other areas where a policy interface would have benefit) to keep iSCSI 
opaque to the upper layer / application requirements.

Mike

References:
- RE: iSCSI: Flow Control
  - From: Jim McGrath <Jim.McGrath@quantum.com>

Prev by Date: RE: A Transport Protocol Without ACK (resend)
Next by Date: RE: An IPS Transport Protocol (was A Transport Protocol Without ACK)
Prev by thread: RE: iSCSI: Flow Control
Next by thread: RE: ISCSI: flow control
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:07:07 2001
6315 messages in chronological order