SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RE: SNACK and recovery



    > My objection to the complexity inherent in StatSN/SNACK/SACK
    > is in part motivated by the experience of those that run SCSI
    > commands over FC on high speed links.  In this FC context
    > retained status on the target is simply not supported.  But you
    > (and many on the list) know this  -- I'm not sure what to make
    > of your points above, are we just agreeing on this fact?
    
    And at the moment SNACK is not required by the iSCSI
    specification.  Such a target can choose to continue not to
    retain status and hence reject all SNACKs (although the
    result may well be that the TCP connection closes).  Whether
    we have too many error recovery options and mechanisms
    is a separate issue - as long as the SNACK mechanism
    is optional, targets that find it burdensome don't have to
    implement it.
    
    > >- Does a 16-bit TCP checksum catch enough of
    > >the corruption events to make it acceptable to
    > >take drastic measures like aborting a backup
    > >when a 32 bit CRC fails on a response that
    > >made it through the 16 bit checksum?
    > 
    > Is it correct to ignore link-level error correction?
    
    No, if the link corrects the error, it's not a corruption
    event visible to an end system because it cannot
    cause a TCP retransmit or iSCSI error recovery of
    any sort.
    
    > I tried to be clear in my opinion and its basis, but I don't
    > claim specific tape experience.  You have the paper by Stone
    > and Partridge, could we agree to a number within the range
    > that they set out.  What do you like, say 1 in 5 billion
    > packets have a TCP cksum failure?
    
    This begins an interesting math adventure.  Let me play devils
    advocate here, and accept the 1 in 5 billion number (of
    failures undetected by the TCP checksum) to start,
    and assume that the CRC catches essentially 100%
    of those failures.  FWIW, the 1 in 5 billion rate translates
    to about twice a day for 1k packets in one direction of
    a saturated gigabit link, so if the 1 in 5 billion number
    is correct, the case for data CRCs is crystal clear
    (this is not related to Jon's point because there's a
    lot more data flowing than status).
    
    > Now, what to say about tapes.  Just naive conjecture on my
    > part but here goes.  Assume a 20 gig disk being backed up to
    > tape over iSCSI; what xfer size do we like for the write CDBs?
    > Would one Meg be OK for one write command?  Would that then be
    > 20480 responses covered by StatSNs to backup the 20 gig?
    
    1 Meg seems way too large.  Let's try 32K - this is
    1/32nd of 1 Meg and makes everything happen 32
    times as often.
    
    > Assuming that each response is a distinct TCP segment and
    > ignoring the fact that the corrupt data may not actually be
    > in the iSCSI header part of the TCP segment.  Then one backup
    > would fail for every 244,140 attempts.  Assuming that we do
    > the backup every day, that means we must redo the backup (for
    > this specific error case) once every 668 years (ignoring leap
    > year days).
    
    Divide by 32 and we get once every 21 years.  Now, let's try to
    back up a terabyte - that's 50 times 20 Gig and the failure
    occurs once every 5 months - that's not good, but if 150
    sites try to do this every night, on average there will
    be one failure a night.  That's often enough to be a real problem
    If one a terabyte a day seems excessive (it's not, but ...) let's
    try it once a week.  Across 1000 sites, the average is again
    around a failure a day.
    
    I'm not claiming that my numbers are any more realistic than
    Jon's.  Does anyone on the list want to paint the "tape expert"
    target on their back and tell us where on this range the
    numbers that correspond to reality lie?
    
    > Maybe, and are there are other rare errors to consider?  Where
    > is the line drawn (or how many pages of error recovery state
    > diagrams are enough? :-)
    
    Believe it or not, that's an open issue.  There are a number of folks
    toiling away off-line on figuring out just what it takes to fully
    describe error recovery based on the current state of things (or
    some modifications that allow the task to be completed in a
    reasonable amount of time).  With luck, we'll be able to expose
    the result of their hard work to the group in the near future so
    that between the list and the Nashua meeting we can have an
    informed discussion about what is necessary in the way of
    error recovery.
    
    Thanks,
    --David
    
    ---------------------------------------------------
    David L. Black, Senior Technologist
    EMC Corporation, 42 South St., Hopkinton, MA  01748
    +1 (508) 435-1000 x75140     FAX: +1 (508) 497-8500
    black_david@emc.com       Mobile: +1 (978) 394-7754
    ---------------------------------------------------
    
    


Home

Last updated: Tue Sep 04 01:05:09 2001
6315 messages in chronological order