SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: iSCSI draft 02: digests



    Randall-
    
    I can think of two reasons where the TCP checksum can be OK,
    but the stonger digest will pass.  I would expect that both
    would happen in a bridge, router, or gateway where at some
    level the layer-2 CRC is removed and regenerated on the other
    side of the box for a given TCP segment.
    
    1) Data could be randomly corrupted between the interfaces
    (e.g. over a bus or switch) in the router.  I've seen this
    in past lives building bridges and routers, usually due to
    a bad memory location.
    
    2) Data could be corrupted due to a bit pattern-related bug
    in the bridge or router.
    
    In case (1), simply resending the PDU would in theory work,
    since the same corruption is not likely to happen at-random
    again, and still pass the TCP checksum.  However, we don't
    know whether we corrupted some data, or a length field, so
    resending the PDU on the same connection is not really an
    option.  Tearing down the old connection, and building a new
    one (keeping the session if possible) is easier than trying
    to find boundaries and re-synchronize the connection, and
    trying to find boundaries may lose PDUs in between anyway.
    
    In case (2), if a particular bit pattern always causes corruption,
    and that bit pattern is in something that doesn't change with
    the re-send (like the SCSI data), we can keep trying forever, and
    it will likely keep failing.  In this case, the equipment in
    between has not been tested, and in any case needs to be replaced.
    It's better to keep retrying a connection than to have written
    bad data and not informed anyone.
    
    Also in case (2), if the bit pattern causing the problem or bug
    to show up was in the TCP header, building a new connection is
    likely to solve the problem, since most of the fields in this
    header will be different (source port, sequence, and ack fields)
    for a new connection.
    
    Anyway, since we are implementing at the iSCSI application layer,
    and not within TCP, we have no way to re-transmit a bad segment
    due to our own CRC errors, since TCP considered the segment to be
    just fine, and by the time we have the data, we probably can't even
    find out where the segment boundaries originally were, or which
    segment within the data was corrupted.
    
    Given that re-synchronizing a connection would be tricky and
    troublesome, I think that establishing a new connection when this
    happens is our only choice, and has a decent probability of
    solving the problem in at least some cases.  If there are worse
    problems, none of these methods will fix them anyway, and in that
    case, our job is to make sure that no bad data is delivered and
    that data is not written to the wrong location.  An implementor
    should also have a way to count these errors and alert someone
    if session recovery does not work.
    
    --
    Mark
    
    "Randall R. Stewart" wrote:
    > 
    > Mark/Julian:
    > 
    > Mark Bakke wrote:
    > >
    > > Yes, the connection should not be recovered, but the iSCSI session
    > > can be.
    > 
    > I have thought on this for a bit, and it seems to me one must
    > have a look at what went wrong if a digest fails and yet TCP
    > still delivered the packet. Was it TCP's fault? Well in some
    > ways one could answer yes... since what happened is a sequence
    > of bit errors was somewhere introduced to the IP packets that
    > caused:
    > 
    > A) TCP's (and IP's) checksum to still pass
    > and
    > B) The stronger digest protection detected the error.
    > 
    > Now then, the question is how can we fix the problem?
    > I think that you must get a new copy of the bad segment
    > to the receiver... Here in lies the question, what will
    > make it so we can do so:
    > 
    > Will restarting the TCP connection (or keeping the same
    > one for that matter), make any difference? I don't think
    > so... it was something in the network that caused this
    > error to occur, if it is a fluke then retransmitting the
    > packet on the same or a new connection will not make
    > a bit of difference ... since it is the network that
    > corrupted the packet. If it is not a fluke random chance, then
    > you have a more serious network (or TCP stack) problem and
    > all the restarts in the world are not going to make the packet
    > go through since the network or stack will just keep re-corrupting
    > the packet...
    > 
    > Bottom line is I am not convinced reseting the connection will
    > gain you anything.. I don't think it will hurt .. but I don't
    > see you gaining anything...
    > 
    > R
    > 
    > >
    > > --
    > > Mark
    > >
    > > julian_satran@il.ibm.com wrote:
    > > >
    > > > Mark,
    > > >
    > > > I also gave it some more thought.
    > > >
    > > > Since a digest failure is a transport failure that went undetected by TCP
    > > > dropping and restarting a connection won't do us to much good - if we use
    > > > the same link we may end up having some more.
    > > >
    > > > We should treat them as iSCSI failures and have iSCSI restart the command
    > > > without restarting the connection.
    > > >
    > > > Regards,
    > > > Julo
    > > >
    > > > Mark Bakke <mbakke@cisco.com> on 05/12/2000 00:18:28
    > > >
    > > > Please respond to Mark Bakke <mbakke@cisco.com>
    > > >
    > > > To:   Julian Satran/Haifa/IBM@IBMIL
    > > > cc:   ips@ece.cmu.edu
    > > > Subject:  Re: iSCSI draft 02: digests
    > > >
    > > > Julian-
    > > >
    > > > Here's what we had in mind for recovering from digest/CRC failures:
    > > >
    > > > 1. If the digest failure is on a command, status, or iSCSI header,
    > > >    this means that a length field could be corrupted.  This should
    > > >    not happen often, but it may be possible to re-send the command
    > > >    if both the initiator and target can do session recovery as in
    > > >    the iSCSI spec.  In any case, the connection should be terminated,
    > > >    and a new one built in its place.  If session recovery is supported
    > > >    and is successful, the missing iSCSI PDU(s) during and after the
    > > >    digest failure are re-send, re-responded, and no harm done.  If
    > > >    session recovery fails, the upper SCSI layer must receive the
    > > >    failure, and do whatever recovery is necessary.  In any case, the
    > > >    old connection should not be used after the failure.
    > > >
    > > > 2. If the digest failure is on a SCSI data block, iSCSI length fields
    > > >    are not affected, so there may be a possible way to resend the
    > > >    data.  However, doing this is probably not worthwhile, so I think
    > > >    that in the data digest case, the same recovery as in (1) should
    > > >    be used.
    > > >
    > > > --
    > > > Mark
    > > >
    > > > julian_satran@il.ibm.com wrote:
    > > > >
    > > > > Like on a data failure on any bus. Raise a check condition and end the
    > > > > command with an error but let it go up to
    > > > > the normal end.  I will spec it.
    > > > >
    > > > > Thanks,
    > > > > Julo
    > > > >
    > > > > Matt Wakeley <matt_wakeley@agilent.com> on 29/11/2000 01:39:22
    > > > >
    > > > > Please respond to Matt Wakeley <matt_wakeley@agilent.com>
    > > > >
    > > > > To:   ips@ece.cmu.edu
    > > > > cc:
    > > > > Subject:  iSCSI draft 02: digests
    > > > >
    > > > > In appendix A is a (brief) description of the iSCSI header and data
    > > > > digests.
    > > > >
    > > > > What is the expected behavior if there is a digest failure?  Just throw
    > > > the
    > > > > PDU away?
    > > > >
    > > > > -Matt
    > > >
    > > > --
    > > > Mark A. Bakke
    > > > Cisco Systems
    > > > mbakke@cisco.com
    > > > 763.398.1054
    > >
    > > --
    > > Mark A. Bakke
    > > Cisco Systems
    > > mbakke@cisco.com
    > > 763.398.1054
    > 
    > --
    > Randall R. Stewart
    > randall@stewart.chicago.il.us or rrs@cisco.com
    > 815-342-5222 (cell) 815-477-2127 (work)
    
    -- 
    Mark A. Bakke
    Cisco Systems
    mbakke@cisco.com
    763.398.1054
    


Home

Last updated: Tue Sep 04 01:06:11 2001
6315 messages in chronological order