SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: [Tsvwg] [SCTP checksum problems]



    
    
    Jonathan,
    
    Thanks for your comments.   We are aware that we really don't know what the
    error model for the end to end transport is and we took a conservative
    approach  - we do an end-to-end  data check above what TCP offers and not
    aligned with the TCP packets.
    
    We assume that storage boxes will be better built that other middle boxes
    and hardware accelerators in the endpoints will not cause trouble.
    
    For a completely garbled link we think that a combination of a good, CRC
    and format checks will keep us from passing around corrupted data and solid
    recovery mechanisms will keep us from failing the QoS expected.
    
    The one question that we could not get any decent answer to is - how  would
    mechanisms other than CRC perform - mainly how will a connection protected
    by cryptographic authenticators perform on connections with errors.
    
    Regards,
    Julo
    
    Jonathan Stone <jonathan@dsg.stanford.edu> on 18/04/2001 17:50:24
    
    Please respond to Jonathan Stone <jonathan@dsg.stanford.edu>
    
    To:   Julian Satran/Haifa/IBM@IBMIL
    cc:   Randall Stewart <rrs@cisco.com>, "WENDT,JIM (HP-Roseville,    ex1)"
          <jim_wendt@hp.com>, ips@ece.cmu.edu, tsvwg@ietf.org, "'Craig
          Partridge'" <craig@aland.bbn.com>, Jonathan Wood
          <Jonathan.Wood@sun.com>, xieqb@cig.mot.com, Jonathan Stone
          <jonathan@dsg.stanford.edu>
    Subject:  Re: [Tsvwg] [SCTP checksum problems]
    
    
    
    
    
    Julian,
    
    I skimmed your i-d late last night.
    
    I have not gone through the analysis of different CRCs. I'd like to
    compare it to Raj Jain's analysis of the IEEE 802 CRC-32 in
    http://www.cis.ohio-state.edu/~jain/papers/xie1.html; which I think
    speaks to the siglne-bit-error point Craig has already raised.
    
    The question I'd raise is a more fundamental one: whether link-level
    bit and burst error rates are the appropriate model for an Internet
    transport-level sum in the first place.
    
    Craig Partridge and I examined that in our SIGCOMM 2000 paper.
    We monitored packets at a number of points in the Internet, and looked
    for packet with checksum mismatches-- packets where recomputing the
    checksum did not match the content of the checksum field.  We also
    looked for transport-level (TCP) retransmissions of the damaged
    packets.
    
    Let's  call packets where a recomputation of the TCP (or UDP) checksum
    does not match the contents of the checksum field a "mismatch".
    
    We observed mismatch rates of roughly 1 in 4,000 (average); best-case
    around 1 in 30,000.  that's 5 or 6 orders of magnitude higher than the
    link-level error rates you cite.  By comparing the checksum-mismatches
    against a TCP-level retransmission, we were able to estimate how much
    damage occured to the mismatches.  Keep in mind that these errors were
    caught at the TCP layer: they have already passed a link-level CRC
    check, usually the 802.3 CRC-32. The very high observed error rate
    suggests thet these errors  occur outside the protection of the
    MAC-layer CRC.
    
    
    For the iSCSI analysis, a fair synopsis is that half the packets were
    so thoroughly curdled we couldn't even guess at what caused thenn
    damage. there are more details in the SIGCOMM paper.
    (There, I focused more on analyzing how the standard TCP sum would fare.
    I am in the midst of recomputing total burst length and hamming
    distances, for a polynomial-xor description of the errors rather
    than the `minimum edit distance'.)
    
    That characterization of errors is very different to the independent
    bit-error and correlated-burst-error models used in the ID.
    
    
    I think our data supports three conclusions relevant to this
    discussion. (You may of course disagree.)
    
    First, the Internet contains a variety of error sources above the
    MAC-level:i between two MAC-layer interface cards inside a router; or
    inside an end-host, between its MAC-layer card and its TCP (or SCTP,
    or UDP, or other transport protocol).
    
    Second, error rates from these sources occur at rates several
    orders of magnitude higher than current link-level errors.
    
    Third, the damage done by these error sources  just does not match
    the individual-bit/ single-burst model common in coding theory,
    and often used to characterize link errors.
    
    While we did observe a (very) few single-bit errors and short bursts,
    we also observed a lot of much longer bursts. Approximately half the
    damaged packets were so thoroughly curdled that more than half the
    bytes were incorrect.  (We also found similar rates and patterns in
    packet traces from Vern Paxson; those are included in our SIGCOMM 2000
    paper.)
    
    It may be helpful to think of *some* of these errors as due to (for
    example) a single-bit error affecting a DMA pointer: flipping an
    address bit can cause a large change in the data stream going to or
    from the network interface. If the bit position flipped is high
    enough, it could even skip to another packet altogether.
    
    
    Its not clear how the analysis and conclusions in your draft stand up,
    if instead of a link-level single-bit/burst error model, we
    ed substitute the error characteristics and rates we observed in
    `in the wild' Internet traffic-- that is, error rates some 5 or
    6 orders of magnitude higher, and where the errors cause either multiple
    bursts per packet, or (if modeled as a single polynomial) vary from a few
    dozen bits, up to a substantial fraction of the packet length.
    
    The order-of-magnitude changes in error rate will obviously have an
    impact.  I haven't thought in detail about whether the conclusions
    about specific CRC polynomial choices hold up.
    
    
    One final point is the computational cost of software CRCs.
    
    If you buy our conclusion that the Internet contains very significant
    error sources outside of "network interface cards". Then, outboard
    acceleration of either checksums or CRCs is somewhat suspect:
    error checks done inside the network card simply doesn't cover those
    error sources.  Software CRC calculations are typically much slower than
    either ones-complement, Fletcher, or Adler sums: Dave Feldmeier's
    paper suggests roughly four times slower, for a total 32-bit check,
    even for generator polynomials selected to minimze nonzero
    coefficients (i.e,. few taps).  For the IEEE 802 CRC, its
    faster to do a table-lookup, which is slower again.
    I dont know whether the iSCSI community has considered that
    issue, or where they/you stand on it.
    
    
    
    


Home

Last updated: Tue Sep 04 01:04:59 2001
6315 messages in chronological order