RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"

To: "'Stephen Bailey'" <steph@cs.uchicago.edu>, ips@ece.cmu.edu
Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
From: Venkat Rangan <venkat@rhapsodynetworks.com>
Date: Mon, 9 Apr 2001 14:53:23 -0700
Content-Type: text/plain;charset="iso-8859-1"
Sender: owner-ips@ece.cmu.edu

Steph,

Not to beat a dead horse, the reason link level CRCs may not be of much help
is because of the following.

The paper "When the CRC and TCP Checksum Disagree" section 5.1 describes the
data transmission path and potential for error introduction at various
points in the path.

At a layer 3 device upon you have:

1. The existing link-level CRC verified and stripped.

2. The payload (IP packet) DMA'ed into some buffers, preserving the original
IP header checksums and TCP checksums.

3. Create a new link-level header.

4. Compute a new CRC.

5. Data sent to the next hop.

If an error is introduced (software or hardware) in steps 2 and 3, the new
CRC introduced in step 4 isn't of any help. The introduced error can be:

1. In the IP header (such as IP address bytes were munged).

2. In the TCP header (such as the port got corrupted).

3. In the TCP checksum itself.

4. In the payload.

Error categories 1 and 2 may cause the packet to be not delivered at all. It
is okay if we do not detect these because they are not delivered to the
iSCSI processing layer. Error 3 would cause the packet to be rejected. Error
4 should normally catch the error, but at an escape rate of 1 in 10e8
escapes detection. (Actually I'm not sure if given the error bias to the
headers, this rate is the rate within the payload of TCP segment). The iSCSI
header and data digest is present to detect that escape.

In the presence of middle boxes that do more than layer 2 forwarding, (say a
box that terminates a TCP connection and re-initiates a new connection) and
if the middle box retains the iSCSI header and data digests but only
computes a new checksum, the transmission path exposure is similar to 2 and
3 above. The header and data digests will enable detection of that.

If the middle box does more than just terminate TCP connections and changes
the iSCSI header and recomputes a new iSCSI header digest and leaves the
data digest alone, at least the data part is protected, but not the header.
If it changes both header and data, there is no protection. In order to get
true end-to-end protection, the application needs to apply a separate
digest, such as creating a 516-byte data block for every 512-byte sector of
data and storing that in the media.

So, the escape rate depends quite a bit on number of middle boxes and the
exposure of data paths. How much do we rely on middle boxes to never
introduce an error during the exposure? Since the referred papers suggest
correct end-to-end delivery of TCP segments with checksum errors in them,
the presence of exposed paths in the middle boxes has been a factor. Still,
rates quoted (1 in 200 million or 1 in 300 million) suggests that it is
necessary to have very strong CRC and detection mechanisms, but it may not
be necessary to optimize the recovery options, so we are able to recover
with the smallest amount of retransmission of data.

I haven't studied the two other references on the subject, but again I
suspect there is evidence to suggest that errors will creep in at
intermediate processing elements.

Venkat Rangan
Rhapsody Networks Inc.
http://www.rhapsodynetworks.com


-----Original Message-----
From: Stephen Bailey [mailto:steph@cs.uchicago.edu]
Sent: Monday, April 09, 2001 11:57 AM
To: ips@ece.cmu.edu
Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport" 


> Exactly, I've worked in this context (though its been some years now).
> It was true (at one time) that tape had a tractability limit, e.g.,
> a tape backup of a terabyte was out of the question.  Has that changed?

I think this is precisely the point.  Existing, off-the-shelf SCSI
solutions DO NOT presently solve this problem.  Both ||SCSI an FCP
burp the operation on a expectable, O(days) failure rate.  The rate of
adoption for the FCP-2 command recovery feature is overwhelming to the
point that the tape guys have been talking about end-running the
problem with explicitly addressed commands.

What we have running iSCSI on TCP is such a drastic improvement in
what you can expect from your SCSI service that we can eventually
expect a disruptive change.  Trying to engineer it to the point where
its 2^100 times more disruptive, when we don't really know where it's
taking us in the first place is meaningless.

[Warning: repetition ahead]

TCP + link layer error detection is engineered precisely to ensure
reliable data delivery.  It's clear from an engineering stand point
that it is likely (not guaranteed, what is?) to do this quite well.
In spite of much research, it seems like nobody here has come up with
a strong indication that TCP + link layer error detection does NOT do
its job well.  I do not think this is because nobody has ever looked
at the problem.

The lack of concrete information to support the case that TCP + link
layer error detection is inadequate has us chasing our tails.

Given the layer iSCSI occupies in the protocol layer cake, if we don't
try to solve which is presently assigned to a lower layer, it seems
quite comfortable to shim additional checks or recovery, or even a
completely
different transport substrate underneath if we do discover TCP + link
layer error detection is not doing the trick, but it really seems like
folly to engineer based upon an assumption that nobody has done a good
job documenting.

Steph

Follow-Ups:
- Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: Stephen Bailey <steph@cs.uchicago.edu>

Prev by Date: RE: iSCSI:flow control, acknowledgement, and a deterministic recovery
Next by Date: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Prev by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by thread: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:05:08 2001
6315 messages in chronological order