SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Some thoughts on synchronization



    
    After watching the synchronization discussion for a
    few days, I'm wondering if this isn't starting to get
    too complicated for our own good.  Basically, iSCSI
    uses TCP connections in almost exactly the same way
    as HTTP/1.1 and CIFS.
    
    The only ways I could come up with where sync would
    be lost are:
    
    1. A bit error happens in a length field (or another
    field controlling the length of a header or data),
    and is not caught by a data-link layer CRC or by the
    TCP checksum.  If an error is detected at either of
    these levels, the layer-2 frame or TCP packet is
    discarded, and allowed to retransmit, so detected
    errors do not cause data integrity or iSCSI framing
    problems. 
    
    2. The initiator is not formatting messages correctly,
    or is expecting incorrect message formats.
    
    3. The target is not formatting messages correctly,
    or is expecting incorrect message formats.
    
    2 and 3 have to be solved by interoperability testing
    anyway; detecting these problems should cause some
    sort of warning via the initiator's and/or target's
    management interfaces.
    
    
    An undetected error can happen in a few ways:
    
    - A set of bit errors within a layer-2 network happens
    that makes an L2 frame still have the same CRC, and
    at the same time makes the overall TCP segment have the
    same checksum.  This is extremely unlikely, but possible.
    
    - Bit errors happen on a router or other device that
    strips and regenerates a layer-2 CRC, and at the same
    time keeps the same TCP segment checksum.  This is
    probably more likely that the previous case, since
    only one of the checks has to fail in its duty.
    
    - Bit errors happen in a bus, switched fabric, or buffer
    memory in either the initiator or target, without being
    detected by whatever scheme (parity, ECC, etc.) is in
    use.  This will likely have a probability similar to
    the bit-error-in-a-router case.
    
    These are all cases that should rarely happen, although
    their probabilities depend heavily on the equipment and
    services purchased by the end customer.  However, they
    can happen, and in some cases it's important enough
    to deal with the possibility.
    
    If we end up with these types of bit errors, corrupting
    length fields, and messing up iSCSI framing, is probably
    the least (or at least just one of) our worries.  For
    example, a read command could become a write, or the wrong
    data could be written to a database block, or the block
    address could be wrong that's read or written, or ...
    
    Anyway, ALL of these cases are equally bad.
    
    I see a few things that are reasonable to do.
    
    1. If either the initiator or target detects a framing
    problem (length fields out-of-range, unexpected values
    in a particular field, etc), it should proceed to shut
    down the connection, and allow the initiator-based
    connection recovery to take place and initiate a new
    connection in its place.  Since these types of errors
    should not happen more than once in a while (once per
    day seems like alot of undetected errors), connection
    recovery will make use of a mechanism we already have,
    instead of adding another one.  If a framing problem is
    detected in this way, chances are that the previous
    iSCSI header was corrupted, and we skipped too many
    or too few bytes to find the next one.  In this case,
    damage may have been done, and either end of the 
    implementation detecting this should attempt to warn
    the user through its normal management interfaces.  
    
    2. A magic number added to the header could help detect
    framing problems.  This makes error checking on iSCSI
    headers much stronger.  Again, however, the liklihood
    is that the damage was actually to the previous iSCSI
    header, and other damage may have already been done
    as a result.  Again, the implementation should warn
    like crazy.
    
    3. If many of these errors are seen in a real customer
    environment, or if the application absolutely cannot
    tolerate a chance of error, then TCP stream resync
    is definitely too weak anyway.  In this case, the
    implementation should provide a data integrity check,
    such as a CRC, on iSCSI headers, commands, and data,
    where CRC errors on these will also cause a connection
    recovery sequence.
    
    I believe that these three suggestions, taken together,
    would solve our synchronization problems without extra
    effort (other than connection recovery and data integrity
    check mechanisms, which we were going to do anyway).
    
    Any other thoughts?
    
    
    -- 
    Mark A. Bakke
    NuSpeed, Inc.
    mark.bakke@nuspeed.com
    763.398.1054
    


Home

Last updated: Tue Sep 04 01:08:11 2001
6315 messages in chronological order