SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: iSCSI: error recovery



    
    
    Pierre,
    
    Interesting scenario - but ENTIRELY WRONG.
    A more carefull reading of the draft would have solved your problem.
    After a failed connection the two parties (I & T) are supposed to do some
    cleanup.
    In the old draft that was accomplished by having the initiator indicate in
    the new login
    what old connection it is replacing.
    
    In the new draft there is an explicit logout that is required before
    resending unacked command.
    
    This mechanism was carefully designed to help avoid ghost commands
    appearing at the target.
    
    Nevertheless - as David Black has suggested - you are encouraged to look
    for holes.
    As to publish or not that is entirely a question of taste.
    I would certainly expect the problems to be real or at least harder to
    crack that this one
    (no pun intended).
    
    Regards,
    Julo
    
    Pierre Labat <pierre_labat@hp.com> on 07/11/2000 02:25:28
    
    Please respond to Pierre Labat <pierre_labat@hp.com>
    
    To:   ips@ece.cmu.edu
    cc:
    Subject:  Re: iSCSI: error recovery
    
    
    
    
    Hello,
    
    
    Some suggestions to simplify/secure the error recovery.
    
    Regards,
    
    Pierre
    
    
    
    
    Using several TCP connections gives an unreliable media.
    Requests,responses and data can be lost,duplicated or ghost
    because TCP connection(s) can drop.
    
    
    Trying to do a recovery can lead to some problems.
    The following scenarios describe some of the problems
    we will have.
    I am sure one can find other ones.
    
    Scenario 1
    ----------
    In this first scenario the recovery is delayed
    unecessary, the retry of a command will fail.
    
    Initiator_ExpCmdRN = 1
    Target_ExpCmdRN = 4
    
    1) Cmd 5 and Cmd 6 sent over NIC1 on the way to the target
    
    2) NIC1 fails
    
    3) Initiator detecting that NIC1 failed, retries Cmd5 and Cmd 6
       on an other NIC and TCP connection
       with their unchanged CmdRN (5 and 6) because 5 and 6 are greater than
       than Initiator_ExpCmdRN. (It is the algorithm described in the draft)
    
    4) The Cmd 5 and Cmd 6 (sent from the failed NIC1 enters the target)
       Target_ExpCmdRN is updated to 7. These commands have no chance to
    complete
       correctly because their TCP connection has been dropped on the initiator
       side.
    
    5) The retry of the Cmd enters the target (through another TCP connection)
       But their CmdRN (5 and 6) are less than Target_ExpCmdRN.
       Hence they are dropped by the target.
    
    6) The retry mechanism fails. The initiator will have to wait for
       the timeout of the commands 5 and 6 to try another recovery.
    
    
    
    
    Scenario 2
    ----------
    
    Initiator_ExpCmdRN = 1
    Target_ExpCmdRN = 4
    Imagine the session has 4 TCP connections.
    
    1) Initiator sends a command with CmdRN = 7 over the TCP connection 1.
       Commands 5 and 6 are on the flight between the initiator and
       the target (on the TCP connection 4 for example).
    
    2) The command 7 is blocked somewhere on the network because of congestion.
    
    3) The TCP connection 1 fails unexpectedly on the initiator side (for
       whatever reason: hard soft,cable disconnected...) and the target can't
       be notified.
    
    4) The initiator (as specified in the draft) sends a retry with CmdRN
       unchanged (CmdRN=7) on the TCP connection 2.
    
    5) The TCP connection 2 fails unexpectedly on the initiator side (for
       whatever reason: hard soft,cable disconnected...) and the target can't
       be notified.
    
    6) The initiator (as specified in the draft) sends a retry with CmdRN
       unchanged (CmdRN=7) on the TCP connection 3.
    
    
    5) The target receives the retry from the connection 3, then the retry
       from the connection 2 then the original command from the connection 1.
       In fact, no luck, it receives things in the inverse order the initiator
       sent them. All these retries/command have the same CmdRN(=7) and same
       initiator task tag, hence the target get several retry for the same
       command and has no clue how to re-order them.
       When the target receives the second retry (from cx 2) it doesn't know
       what to with it. If it supersedes the first retry, the retry will fail
       because the completion will be send on the connection 2 that is failed
       on the initiator side. If it doesn't supersede and if the retries
       would have come in order, the retry would have failed too.
    
    
    Scenario 3
    ----------
    
    1) Cmd 1 sent to the target but blocked in TCP connection 1
    
    2) The initiator sends plenty of commands on other TCP connection(s)
       that are OK.
    
    3) TCP connection 1 fails on initiator side
    
    4) Abort of Cmd 1 sent on TCP connection 2. The Abort is non-numbered
      (CmdRN=0).
       The abort is received by the target
       that returns "function rejected" because there is no
       matching task tag.
    
    5) At this point the initiator doesn't know what to do. Because it
       doesn't know if the command has been lost or if it will come
       in the target later.
    
    6) The command 1 finally reaches the target (ghost IO), and is not aborted.
    
    
    Scenario 4
    ----------
    In this scenario, the whole traffic of a session is blocked
    when one command fails.
    
    1) Cmd 10 sent to the target but blocked in TCP connection 1
       and will never reach the target.
    
    2) The initiator sends plenty of commands on other TCP connection(s)
       that are OK.
    
    3) TCP connection 1 fails on initiator side
    
    4) Abort of Cmd 10 sent on TCP connection 2. The Abort is numbered
       using a new CmdRN.
       The abort is received by the target but not processed because
       the CmdRN of the abort is greater that Target_ExpCmdRN
       that is blocked on 10.
    
    5) The entire command processing (through all TCP connections) is blocked
       on the target at Target_ExpCmdRN = 10
       till SCSI retries the command 10 with the same CmdRN
      (that can takes several seconds). And if SCSI doesn't
       retry with the same CmdRN (10) we have a dead lock.
    
    Scenario 5
    ----------
    Initiator_ExpCmdRN=Target_ExpCmdRN=5
    Initiator_MaxCmdRN=Target_MaxCmdRN=100
    Two TCP connections are used.
    
    1) The initiator sends the command CmdRN=5 over the connection 1
       then the commands CmdRN=6 to CmdRN=100 over the connection 2
    
    2) The initiator can send no more command because
       current CmdRN = MaxCmdRN
    
    3) The TCP connexion 1 breaks on the initiator side and the
       command 5 will never reach the target.
    
    4) The initiator wants to do a recovery with numbered commands
       (abort task for example), but can't send it because CmdRN = MaxCmdRN.
    
    5) The target doesn't want to increment MaxCmdRN because its already
    buffered
       commands up to 100 and have no extra buffer space. It waits for
    receiving
       command 5. It could be because it allocated a maximum amount of memory
       space for the non ordered commands it receives.
    
    6) The initiator waits for MaxCmdRN to increase and the target waits for
       command 5 to come or be aborted. We have a dead lock.
    
    
    Scenario 6
    ----------
    
    1) the initiator sends the command CmdRN=1 on
       a TCP connection
    
    2) the command is stuck in the network
    
    3) The command timeout on the initiator
    
    4) the initiator "retry" the command on the same
       TCP connexion and the retry command is in the network
    
    5) the target receives the original command, executes it,
       and sends the completion.
    
    6) the initiator receives the completion, it doesn't know
       if it is from the original command or from the
       "retry" command because the same initiator task tag is used
       in both commands
    
    
    Solve these problems
    ====================
    To get rid off all these corner cases and have a basic, simple
    and robust recovery mechanism that avoids or manages
    lost,duplicated or ghost we could do:
    
    - keep the fact that every numbered command with a CmdRN out
      of the window [Target_ExpCmdRN,Target_MaxCmdRN]
      is discarded silently.
    
    - recover commands always doing an abort then
      sending again the command with a new CmdRN
      and a new initiator task tag.
    
    - modify sligthly the abort, send it non numbered
      and change a little bit the way non numbered messages are coded.
    
    Below are listed the modifications:
    
    Modification of the coding of the headers
    -----------------------------------------
    for non numbered commands:
    --------------------------
    
    Add a bit in the iSCSI header to indicate
    if the transaction is numbered or not. It allows to use
    (in case the command is non numbered) the CmdRN
    field to reference a command the transaction is targeted to.
    Currently to indicate that a command is non numbered CmdRN
    must be set to 0.
    When the non numbered bit is set, the target doesn't
    discard the request if CmdRN is out of the window
    [Target_ExpCmdRN,Target_MaxCmdRN].
    CmdRN indicates the command the non numbered
    transaction is targeted to. If the non numbered
    transaction is not targeted to any specific command
    CmdRN is set to 0.
    Doing that gives an Abort more robust (see below).
    
    
    Modification of Abort task:
    ---------------------------
    The abort is sent non numbered (with the bit non numbered set)
    The CmdRN is updated with the value corresponding
    to the command to abort.
    
    When the target receives an abort:
    
    - If there is no task associated with CmdRN and
      if CmdRN is out of the window
      [Target_ExpCmdRN,Target_MaxCmdRN].
      The abort returns immediately with success.
    
    - If there is no task associated with CmdRN but
      if CmdRN is in the window [Target_ExpCmdRN,Target_MaxCmdRN].
      The target marks CmdRN as "jump". It means that
      when Target_ExpCmdRN will reach CmdRN, it only will
      jump to CmdRN+1. It prevents a dead lock if the command
      to abort never comes to the target.
    
    - If there is a task associated with CmdRN.
      The target aborts the task or cleans the ressources
      if the task was not yet in a task set, marks CmdRN as "jump",
      and returns successfully.
    
    
    
    The recovery mechanism "retrying" the commands
    ==============================================
    
    Beside the basic recovery abort/new command
    the more sophisticated "retry" may be faster.
    
    The initiator (instead of doing an abort and sending again
    the command with a new initiator task tag and a new CmdRN)
    can send a "retry" message.
    
    To avoid the problems described in the scenarios, the "retry"
    message must be more sophisticated than simply setting the
    retry bit as specified in the draft.
    It must combine a part of the job of an "abort task"
    (to fill the holes in the CmdRN sequence to allow
    Target_ExpCmdRN to make progress) and the job of sending
    again the command.
    
    Modification of the "retry"
    ---------------------------
    This "retry" message has the format of the SCSI command pdu
    except:
    - a "referenced initiator task tag" field is added. It
      references the command to "retry"
    - a "timestamp" field (integer) is added.
    
    
    When the initiator sends a "retry" it:
    
    - sets the retry bit and the non numbered bit
    - updates the CmdRN field with the value of the CmdRN of
      the command to retry
    - generates a new initiator task tag(not the
      one of the task to retry)
    - updates the "referenced initiator task tag" with
      the one of the command to retry.
    - sets the timestamp is 0.
    
    For the following "retry(s)" of the same command
    (in the case the first one failed) the initiator
    generates a new initiator task tag and increments
    the timestamp.
    
    The target when receiving a retry:
       - check if there is a task already associated with CmdRN.
       - if NO (the command has been lost or will come later (ghost))
         the target acts as if it was receiving the original
         command. It records the timestamp.
       - if YES the target check the timestamp. If the one
         in the retry is older than the one in the target, the
         "retry" is discarded silently. If the timestamp in the "retry"
         is newer than the one in the target associated to the command,
         the current task is stopped and restarted, the new
         timestamp is recorded by the target.
    
    Sending the retry non numbered allows the "retry" to reach
    the target even if the command window is closed. That can
    prevent the kind of dead lock described in scenario 5.
    That solves the scenario 1 too.
    
    In the case the first retry doesn't work and the
    initiator needs to send another one (for the same command),
    sending the retries with different "initiator task tags"
    allows the initiator to do the correspondance between
    the retries PDUs and their completions.
    In general as the main goal of the initiator task tag
    is to allow the initiator to do the correspondance
    between the request and the responses, it is cleaner
    for each initiator request to generate a new initiator
    task tag.
    
    Having a timestamp avoid the problems described
    in the scenario 2. The target knows to sort
    between new PDUs and the ghost ones.
    
    Using the CmdRN to reference the command to retry,
    allows the target to:
     - fill the holes in the CmdRN sequence at the target,
       even if the original command never reached the target.
       Target_ExpCmdRN can make progress.
    
    
    A initiator must not send a "retry" if it acknowledged
    the Status of the corresponding command.
    The target can forget the CmdRN of a command as soon as
    the corresponding status has been acknowledged.
    If the target receives a retry with the CmdRN
    that is not in the window [Target_ExpCmdRN,MaxCmdRN]
    and that doesn't correspond to any task whose the status
    as not yet been acknowledged by the initiator,
    the target answers with an iSCSI status of the kind
    "out of range".
    
    It seems to me that these three modifications (non numbered command,
    abort task, retry) allows to have a robust recovery
    eliminating the problems generated by the duplicates,
    ghosts, missing iSCSI PDUs. The target always knows what to do exactly,
    it is specified, and the targe is never blocked.
    
    
    
    
    The StatRN is usefull only if "retry" is used.
    
    
    
    
    
    
    
    
    


Home

Last updated: Tue Sep 04 01:06:27 2001
6315 messages in chronological order