RE: SNACK and recovery

To: jhall@emc.com, ips@ece.cmu.edu
Subject: RE: SNACK and recovery
From: Black_David@emc.com
Date: Fri, 6 Apr 2001 21:20:39 -0400
Content-Type: text/plain;charset="iso-8859-1"
Sender: owner-ips@ece.cmu.edu

> My objection to the complexity inherent in StatSN/SNACK/SACK
> is in part motivated by the experience of those that run SCSI
> commands over FC on high speed links.  In this FC context
> retained status on the target is simply not supported.  But you
> (and many on the list) know this  -- I'm not sure what to make
> of your points above, are we just agreeing on this fact?

And at the moment SNACK is not required by the iSCSI
specification.  Such a target can choose to continue not to
retain status and hence reject all SNACKs (although the
result may well be that the TCP connection closes).  Whether
we have too many error recovery options and mechanisms
is a separate issue - as long as the SNACK mechanism
is optional, targets that find it burdensome don't have to
implement it.

> >- Does a 16-bit TCP checksum catch enough of
> >the corruption events to make it acceptable to
> >take drastic measures like aborting a backup
> >when a 32 bit CRC fails on a response that
> >made it through the 16 bit checksum?
> 
> Is it correct to ignore link-level error correction?

No, if the link corrects the error, it's not a corruption
event visible to an end system because it cannot
cause a TCP retransmit or iSCSI error recovery of
any sort.

> I tried to be clear in my opinion and its basis, but I don't
> claim specific tape experience.  You have the paper by Stone
> and Partridge, could we agree to a number within the range
> that they set out.  What do you like, say 1 in 5 billion
> packets have a TCP cksum failure?

This begins an interesting math adventure.  Let me play devils
advocate here, and accept the 1 in 5 billion number (of
failures undetected by the TCP checksum) to start,
and assume that the CRC catches essentially 100%
of those failures.  FWIW, the 1 in 5 billion rate translates
to about twice a day for 1k packets in one direction of
a saturated gigabit link, so if the 1 in 5 billion number
is correct, the case for data CRCs is crystal clear
(this is not related to Jon's point because there's a
lot more data flowing than status).

> Now, what to say about tapes.  Just naive conjecture on my
> part but here goes.  Assume a 20 gig disk being backed up to
> tape over iSCSI; what xfer size do we like for the write CDBs?
> Would one Meg be OK for one write command?  Would that then be
> 20480 responses covered by StatSNs to backup the 20 gig?

1 Meg seems way too large.  Let's try 32K - this is
1/32nd of 1 Meg and makes everything happen 32
times as often.

> Assuming that each response is a distinct TCP segment and
> ignoring the fact that the corrupt data may not actually be
> in the iSCSI header part of the TCP segment.  Then one backup
> would fail for every 244,140 attempts.  Assuming that we do
> the backup every day, that means we must redo the backup (for
> this specific error case) once every 668 years (ignoring leap
> year days).

Divide by 32 and we get once every 21 years.  Now, let's try to
back up a terabyte - that's 50 times 20 Gig and the failure
occurs once every 5 months - that's not good, but if 150
sites try to do this every night, on average there will
be one failure a night.  That's often enough to be a real problem
If one a terabyte a day seems excessive (it's not, but ...) let's
try it once a week.  Across 1000 sites, the average is again
around a failure a day.

I'm not claiming that my numbers are any more realistic than
Jon's.  Does anyone on the list want to paint the "tape expert"
target on their back and tell us where on this range the
numbers that correspond to reality lie?

> Maybe, and are there are other rare errors to consider?  Where
> is the line drawn (or how many pages of error recovery state
> diagrams are enough? :-)

Believe it or not, that's an open issue.  There are a number of folks
toiling away off-line on figuring out just what it takes to fully
describe error recovery based on the current state of things (or
some modifications that allow the task to be completed in a
reasonable amount of time).  With luck, we'll be able to expose
the result of their hard work to the group in the near future so
that between the list and the Nashua meeting we can have an
informed discussion about what is necessary in the way of
error recovery.

Thanks,
--David

---------------------------------------------------
David L. Black, Senior Technologist
EMC Corporation, 42 South St., Hopkinton, MA  01748
+1 (508) 435-1000 x75140     FAX: +1 (508) 497-8500
black_david@emc.com       Mobile: +1 (978) 394-7754
---------------------------------------------------

Follow-Ups:
- Re: SNACK and recovery
  - From: Santosh Rao <santoshr@cup.hp.com>

Prev by Date: RE: iSCSI: Out Of Sequence due to null sequence with multiple connections.
Next by Date: DRAFT Minneapolis Minutes
Prev by thread: Re: SNACK and recovery
Next by thread: Re: SNACK and recovery
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:05:09 2001
6315 messages in chronological order