[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: iSCSI: No Framing

    In message <>,
    "Peglar, Robert" writes:
    >The original thread began with a question (paraphrased) about '...what
    >applications could consume a 10G pipe for long periods of time'.  I answered
    >that question - disk-disk backup and subsystem replication.
    Even disk-to-disk applications or backup applications really want
    approximately BW*RTT worth of buffering.  Hugh Holbrook's recent
    Stanford PhD thesis traces the conventional wisdom back to an email
    from Van Jacobson to the e2e list in 1990.
    It's reasonably well-known in the TCP community that TCP slow-start
    generates spiky traffic. It leads to bursts of high buffer occupancy
    (e.g., at the point where the exponential rampup switches to
    congestion avoidance.)  Indeed, that was the motivation behind
    TCP-Vegas, and the recent work on TCP pacing.
    The whole debate over framing/marking only makes sense if one views
    outboard NIC buffering of RTT*BW as very expensive (e.g., forcing a
    design from onchip RAM to external SRAM). Adding framing of iSCSI PDUs
    allows the NIC to continue doing direct data placement into host
    buffers, accomodating the BW*RTT of TCP buffering in "cheap" host RAM
    rather than "expensive" NIC RAM.  
    But you can't get away from providing the buffers. Not unless you are
    also willing to artificially restrict throughput.  If iSCSI doesn't
    provide some form of framing, then what can a NIC on a MAN with medium
    BW*RTT do, if it sees a drop? It has only a few choices:
      1. start buffering data outboard, hoping that TCP fast-retransmit will
         send the missing segment(s) before the outboard buffers  are exhausted;
      2. Give up on direct  data placment, and start delivering packets to
        host memory, any old how --at the cost of SW reassembly and alignment
        problems, and a software CRC, once the missing segment is recovered.
      3. Start dropping packets, and pay a huge performance cost.
    There are some important caveats around the BW*RTT: if we can
    *guarantee* that the iSCSI NICs are never the bottleneck point, or
    that TCP never tries to reach the true link BW*RTT (due to undersized
    windows), then one can get away with less. (See Hugh Holbrook's thesis
    for more concrete details).
    But the lesson to take away is that even in relatively well-behaved
    LANs, TCP *by design* is always oscillating around overloading the
    available buffers, causing a drop, then backing off.  See, for
    example, Figure 2 of the paper by Janey Hoe which introduced "New
    Reno"; or Fig. 2 and 3 of the paper by Floyd and Fall. New Reno avoids
    the long-timeouts between each drop, but the drops themselves still
    Moral: TCP can require significant buffering even on quite modest
    networks.  It __may__ be worth keeping framing, so that host NICs can
    do more of that buffering in host memory rather than outboard; and so
    they can continue performing DDP rather than software reassembly and
    software CRC checking. Storage devices are another issue again.
    Van Jacobson, modified TCP congestion avoidance algorithm.
    Email to, April 1990.
    L Brakmo, , S O'Malley, L Peterson, TCP Vegas: new techniques for congestion
    detection and control, SIGCOMM 94.
    J Kulik, R Coulter, D Rockwell, and C Partridge, A
    simulation study of paced TCP. BBN TEchnical Memorandum 1218, BBN,
    August 1999.
    J Hoe, Improving the Startup Behaviour of a Congestion Control Scheme
    for TCP,  ACM SIGCOMM 1996, 
    S Floyd and K Fall, Simulation-based comparisons of Tahoe, Reno, and SACK
    TCP, Comp. COmm. Review no 6 v 3, April 1996.
    H Holbrook.  A Channel Model for Multicast.  PhD Dissertation.
    Department of Computer Science.  Stanford University.  August, 2001.{,.gz}. (See Chapter 5.)
    (Holbrook cites Aggrawal, Savage, and Anderson, INFOCOMM 2000, on the
    downsides of TCP pacing; but I haven't read that.  The PILC draft on
    link designs touch the same issue, but the throughput equations cited
    there factor out buffer size.)
    >FC is not sufficient.  Storage-to-storage needs all the advantages as well
    >as that which iSCSI has to offer the host-storage model.
    But it will still need approximately BW*RTT of buffering, even for
    low-delay LANS. Or performance will fall off a cliff under
    "congestion" -- e.g., each time some other iSCSI flow starts up,
    begins competing for the same TCP endpoint buffers, on the same iSCSI
    device, and triggering a burst of TCP loss events for the
    storage-to-storage flow.


Last updated: Wed Feb 06 01:18:16 2002
8661 messages in chronological order