Re: iSCSI: No Framing

To: "Peglar, Robert" <robert_peglar@xiotech.com>
Subject: Re: iSCSI: No Framing
From: Jonathan Stone <jonathan@dsg.stanford.edu>
Date: Tue, 05 Feb 2002 19:09:41 -0800
cc: ips@ece.cmu.edu
In-reply-to: Your message of "Mon, 04 Feb 2002 14:21:22 CST." <ED8EDD517E0AA84FA2C36C8D6D205C1301CBF2C5@alfred.xiotech.com>
Sender: owner-ips@ece.cmu.edu

In message <ED8EDD517E0AA84FA2C36C8D6D205C1301CBF2C5@alfred.xiotech.com>,
"Peglar, Robert" writes:

>The original thread began with a question (paraphrased) about '...what
>applications could consume a 10G pipe for long periods of time'.  I answered
>that question - disk-disk backup and subsystem replication.

Even disk-to-disk applications or backup applications really want
approximately BW*RTT worth of buffering.  Hugh Holbrook's recent
Stanford PhD thesis traces the conventional wisdom back to an email
from Van Jacobson to the e2e list in 1990.

It's reasonably well-known in the TCP community that TCP slow-start
generates spiky traffic. It leads to bursts of high buffer occupancy
(e.g., at the point where the exponential rampup switches to
congestion avoidance.)  Indeed, that was the motivation behind
TCP-Vegas, and the recent work on TCP pacing.

The whole debate over framing/marking only makes sense if one views
outboard NIC buffering of RTT*BW as very expensive (e.g., forcing a
design from onchip RAM to external SRAM). Adding framing of iSCSI PDUs
allows the NIC to continue doing direct data placement into host
buffers, accomodating the BW*RTT of TCP buffering in "cheap" host RAM
rather than "expensive" NIC RAM.  

But you can't get away from providing the buffers. Not unless you are
also willing to artificially restrict throughput.  If iSCSI doesn't
provide some form of framing, then what can a NIC on a MAN with medium
BW*RTT do, if it sees a drop? It has only a few choices:

  1. start buffering data outboard, hoping that TCP fast-retransmit will
     send the missing segment(s) before the outboard buffers  are exhausted;

  2. Give up on direct  data placment, and start delivering packets to
    host memory, any old how --at the cost of SW reassembly and alignment
    problems, and a software CRC, once the missing segment is recovered.

  3. Start dropping packets, and pay a huge performance cost.

There are some important caveats around the BW*RTT: if we can
*guarantee* that the iSCSI NICs are never the bottleneck point, or
that TCP never tries to reach the true link BW*RTT (due to undersized
windows), then one can get away with less. (See Hugh Holbrook's thesis
for more concrete details).

But the lesson to take away is that even in relatively well-behaved
LANs, TCP *by design* is always oscillating around overloading the
available buffers, causing a drop, then backing off.  See, for
example, Figure 2 of the paper by Janey Hoe which introduced "New
Reno"; or Fig. 2 and 3 of the paper by Floyd and Fall. New Reno avoids
the long-timeouts between each drop, but the drops themselves still
occur.

Moral: TCP can require significant buffering even on quite modest
networks.  It __may__ be worth keeping framing, so that host NICs can
do more of that buffering in host memory rather than outboard; and so
they can continue performing DDP rather than software reassembly and
software CRC checking. Storage devices are another issue again.

References:

Van Jacobson, modified TCP congestion avoidance algorithm.
Email to end2end@isi.edu, April 1990.

L Brakmo, , S O'Malley, L Peterson, TCP Vegas: new techniques for congestion
detection and control, SIGCOMM 94.

J Kulik, R Coulter, D Rockwell, and C Partridge, A
simulation study of paced TCP. BBN TEchnical Memorandum 1218, BBN,
August 1999.

J Hoe, Improving the Startup Behaviour of a Congestion Control Scheme
for TCP,  ACM SIGCOMM 1996, 

S Floyd and K Fall, Simulation-based comparisons of Tahoe, Reno, and SACK
TCP, Comp. COmm. Review no 6 v 3, April 1996.

H Holbrook.  A Channel Model for Multicast.  PhD Dissertation.
Department of Computer Science.  Stanford University.  August, 2001.
http://dsg.stanford.edu/~holbrook/thesis.ps{,.gz}. (See Chapter 5.)

(Holbrook cites Aggrawal, Savage, and Anderson, INFOCOMM 2000, on the
downsides of TCP pacing; but I haven't read that.  The PILC draft on
link designs touch the same issue, but the throughput equations cited
there factor out buffer size.)

>FC is not sufficient.  Storage-to-storage needs all the advantages as well
>as that which iSCSI has to offer the host-storage model.

But it will still need approximately BW*RTT of buffering, even for
low-delay LANS. Or performance will fall off a cliff under
"congestion" -- e.g., each time some other iSCSI flow starts up,
begins competing for the same TCP endpoint buffers, on the same iSCSI
device, and triggering a burst of TCP loss events for the
storage-to-storage flow.

References:
- RE: iSCSI: No Framing
  - From: "Peglar, Robert" <robert_peglar@xiotech.com>

Prev by Date: RE: iSCSI: No Framing
Next by Date: RE: iSCSI: No Framing
Prev by thread: RE: iSCSI: No Framing
Next by thread: Re: iSCSI: No Framing
Index(es):
- Date
- Thread

Home

Last updated: Wed Feb 06 01:18:16 2002
8661 messages in chronological order