SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: SNACK and recovery



    David,
    
    Since you addressed this to me, I'll reply, but be forewarned
    I really have nothing more to add to this thread :-).
    
    Black_David writes:
    >This turns out to be a matter not just of rarity,
    >but also one of consequences.  As Mark points
    >out, for tapes and similar devices, the consequences
    >are disastrous - the backup aborts, and when
    >"those in charge" come in the next morning,
    >they have no usable backup tape, and are very
    >unhappy.  While Jon says "streams devices must
    >support abort and retry for extreme errors in
    >any case", the abort may well be the entire
    >backup and the retry might be next weekend ...
    >not a good situation.
    
    Perhaps, but could you describe the process that supports this
    scenario?  A tape backup procedure that must succeed entirely,
    and, if not, may only be repeated the following weekend -- seems
    like it would be hard to sustain (quite apart from this discussion).
    
    >Over in Fibre Channel world, FCP-2 contains
    >recovery support that resulted from the
    >discovery that despite the fact that non-
    >delivery of a Fibre Channel frame (Class 2 or
    >3 - it doesn't matter which) is "extremely
    >rare":
    >- Buffer overrun is prevented by both link
    >	and end-to-end buffer usage controls.
    >- FC switches are engineered to not drop
    >	frames to the maximum extent possible
    >	due in part to these consequences.
    >- There's a 32-bit CRC covering the entire
    >	FC frame.
    >failure to deliver a frame happens often enough
    >that a recovery mechanism is needed to avoid
    >tape backup aborts and the like.  Unlike TCP,
    >Fibre Channel has no built-in retransmit
    >mechanism.
    
    My objection to the complexity inherent in StatSN/SNACK/SACK
    is in part motivated by the experience of those that run SCSI
    commands over FC on high speed links.  In this FC context
    retained status on the target is simply not supported.  But you
    (and many on the list) know this  -- I'm not sure what to make
    of your points above, are we just agreeing on this fact?
    
    >In contrast to Fibre Channel, we are dealing
    >with something rarer because TCP retransmit will
    >take care of most things that can go wrong in
    >switches and there's a 16 bit checksum whose
    >failure will trigger retransmits.  What this
    >appears to come down to is:
    >
    >- Does a 16-bit TCP checksum catch enough of
    >the corruption events to make it acceptable to
    >take drastic measures like aborting a backup
    >when a 32 bit CRC fails on a response that
    >made it through the 16 bit checksum?
    
    Is it correct to ignore link-level error correction?
    
    >The discussion's been a bit convoluted.  Some
    >simple yes/no answers to the above question
    >accompanied by short reasoning would be appreciated.
    >I think Julian's said "no" and quoted a filesystem
    >number that we're awaiting a reference to.
    
    I tried to be clear in my opinion and its basis, but I don't
    claim specific tape experience.  You have the paper by Stone
    and Partridge, could we agree to a number within the range
    that they set out.  What do you like, say 1 in 5 billion
    packets have a TCP cksum failure?
    
    Now, what to say about tapes.  Just naive conjecture on my
    part but here goes.  Assume a 20 gig disk being backed up to
    tape over iSCSI; what xfer size do we like for the write CDBs?
    Would one Meg be OK for one write command?  Would that then be
    20480 responses covered by StatSNs to backup the 20 gig?
    
    Assuming that each response is a distinct TCP segment and
    ignoring the fact that the corrupt data may not actually be
    in the iSCSI header part of the TCP segment.  Then one backup
    would fail for every 244,140 attempts.  Assuming that we do
    the backup every day, that means we must redo the backup (for
    this specific error case) once every 668 years (ignoring leap
    year days).
    
    [ Note, the error is detected, no corrupted data has gone
     unrecognized, the downside is that the backup must be redone. ]
    
    Now, I don't credit my tape assumptions (though I hope they
    are generous wrt the counter argument) -- those who know tape
    processes and iSCSI flows should adjust them.  But, if this
    scenario is in the ballpark, then "yes" seems to be the
    answer to your question.
    
    >Just to muddy the waters further, let me point out
    >that tape targets tend to be less complex than
    >disk targets.  Tapes don't reorder commands, and
    >often don't even queue them.  Saving the last N
    >responses is not that difficult when the responses
    >go out in the order that the commands came in
    >(easier to organize saving them), and the initiator
    >has to be very careful about the number of commands
    >in flight to avoid disasters caused by dropped
    >commands (should lead to reasonable results from
    >relatively small values of N).
    
    Maybe, and are there are other rare errors to consider?  Where
    is the line drawn (or how many pages of error recovery state
    diagrams are enough? :-)
    
    -Jon
    


Home

Last updated: Tue Sep 04 01:05:09 2001
6315 messages in chronological order