SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Proposed Connection Recovery Additions for Draft 03



    
    Draft 03 seems to be pretty close as far as dealing with
    connection recovery.  Here are a few additions that we
    (NuSpeed) think will help complete the picture, along with
    what we believe are some of the requirements.
    
    Other than hopefully clarifying initiator and target behavior,
    this scheme adds one field to the Command Request, and one event
    type to the Asynchronous Event message.
    
    I've attempted to include some of our reasoning behind this.
    
    Assumptions
    
    - iSCSI is only a transport for SCSI.  Its recovery scheme
      does not attempt to retry failed commands.  However, it
      is a reliable transport for SCSI, and must deliver commands
      within a session in order to the target.
      
    - TCP handles any losses from a given connection.  The target
      end of the byte stream is either valid, or the connection is
      lost.  Within a connection, there is no such thing as losing
      a packet.  We will, however, need to deal with stronger error
      checking over the SCSI data, but that's (mostly) orthogonal
      to connection recovery.
    
    - This scheme will work with either single or multiple
      connections per session.  It is up to the implementation
      of the initiator and target whether either one supports
      multiple connections.  The initiator can simply not use
      more than one connection per session; the target can deny
      the login if the client requests more than one connection
      per session.
    
    - It is also up to the initiator to determine whether to
      multiplex access to targets and luns over a single session,
      or to use a session for each, or some combination.  If the
      initiator chooses to use multiple sessions to the same
      device, it must be prepared to deal with multipath command
      ordering issues itself.
    
    - We can be fairly optimistic about the longevity of TCP
      connections; if the network is so slow, overloaded, or
      poorly designed as to lose connections regularly, it is
      not likely a good candidate for storage access.  Connection
      recovery should still handle these cases, especially if
      the problems are transient, but need not be optimized for
      these cases.  If a connection fails, it should be acceptable
      to re-send write data with the re-sent command, and to re-send
      read data with the re-sent status.  There may, however, be
      simple optimizations to avoid this, too, especially when
      transporting larger blocks, such as tape reads and writes.
    
    - Some commands, such as FORMAT UNIT and REWIND, may take several
      minutes or more to complete.  Thousands of operations may complete
      before status is returned for these commands.  (Are 16-bit
      reference numbers enough?)
    
    - If multiple connections are used, they are symmetrical (no special
      control or data connections).  Command-Data-Status connection
      allegiance is also assumed, and CmdRN and StatRN are used to
      ensure in-order delivery.
    
    - CmdRN and StatRN are implemented as in Draft 03, as 16-bit,
      per-session incrementing counters.
    
    
    Requirements for Connection Recovery
    
    - Protocol fields associated with the connection recovery
      scheme will work with either a single connection per
      session, or multiple connections per session.
    
    - iSCSI must preserve ordered delivery within a session.
    
    - The transport may re-send commands, data, and status at
      any time, but must not attempt to re-try the actual command
      at the target without involving the upper (SCSI) layer for
      recovery.  This means that, as in section 4.1 of draft 03,
      the client should keep sufficient information handy to re-send
      commands and data until status is received.
    
    - We can generally make an exception to the above for commands
      issued to a block device (disk); reads and writes are idempotent,
      as long as the commands are re-issued in the original order.
    
    - Commands must be issued at the target end of an iSCSI session
      in-order, but status may, of course, be returned from the iSCSI
      target to the initiator in any order.
    
    - Either the initiator or target may decide to terminate a
      connection.  It is the responsibility of the initiator to
      reconnect if it so chooses.
    
    - A connection must be recoverable quickly.  At most, a connection
      must fail, be detected as failed, be restarted, have commands
      reissued, and get status back (except on high-latency commands)
      within a portion of the normal SCSI timeout window (30 seconds).
      The actual time for this depends on the network, the commands
      issued, etc.  At any rate, connection recovery must be as 
      transparent as possible to the end user or application.
    
    - Connection recovery should work for target reboot or failover.
    
    - Basically, we have to handle the following steps for each
      connection:
    
      1. Detection - deciding when a connection is down, or should be.
      2. Disconnection - terminating a connection.
      3. Reconnection - re-connecting to the target.
      4. Resend - re-sending commands to the target.
    
      Besides these procedures, normal mechanisms such as reference
      numbers and response caching will be in place to support these
      procedures when they are needed.
    
    Support Mechanisms
    
      The initiator and target must keep some state around in order to
      support connection recovery and resending of commands, data, and
      status that may have been lost.  Their responsibilities are
      outlined in section 4.1 of Draft 03.
    
      Basically, an Initiator must:
      
      - Increment CmdRN for each new command request sent.
    
      - Keep information required to rebuild and resend each command
        with its data until the matching command response is received
        from the target.
    
      - Acknowledge command responses soon after they are received
        from the target.
    
      A Target must:
    
      - Increment StatRN for each new status response sent.
    
      - Keep a cache of responses (status & sense data) until the
        StatRN is acknowledged by the initiator.
    
      - For non-disk devices, keep data response (read data) along
        with the cached command response (although this might be
        difficult with large-block devices).
    
      Reclaiming Cached Responses - section 4.1 already mentioned most
      of the above; however, there was no mechanism for notifying the
      target that its cached responses were no longer needed.  In this
      scheme, an AckStatRN is sent from the initiator to the target,
      as the highest (honoring wrap) consecutive value received for
      StatRN in a response on any connection in the session.  All
      cached responses up to and including this StatRN value may be
      safely de-allocated.
    
    Detecting Connection Failure
    
      During an initiator, target, or intervening network outage, whether
      temporary or permanent, TCP connections will normally be retried
      for much longer than most SCSI drivers can handle.  In many cases,
      new connections can be made and started long before the old
      connection times out.  For this reason, we have to detect connections
      that have gone away.  Both the initiator and the target may detect
      these conditions, and should detect them in a timely manner (let's
      say 5 seconds for now, but we need to think about this).
    
      From the initiator's point of view, the connection can fail for
      several reasons (temporary or permanent):
    
      - Target powered down or removed from network
      - Target reboot or failover
      - Lost network route
      - Backed-off (slow) tcp connection
      - Unexpected message fields received (software error on target)?
    
      If no responses are being received from the target, and there are
      outstanding commands, the initiator will periodically send a ping
      request, and expect a ping response within a small amount of time.
      If no ping response is received, the connection is considered
      to have failed.  This is mentioned in section 4.1 as well.
    
      From the target's point of view, the connection can fail for
      several reasons (temporary or permanent):
    
      - Initiator powered down or removed from network
      - Initiator reboot or failover
      - Lost network route
      - Backed-off (slow) tcp connection
      - Unexpected message fields received (software error on initiator)?
    
      Since the target does not send requests, it could do one of two
      things:
    
      1. During the login phase, negotiate a maximum inactivity time 
         for the incoming target connection.  If this time will be
         exceeded, the client promises to send an iSCSI ping request
         on the connection to keep it alive.  If the inactivity timer
         expires on the target, the connection is assumed to have failed.
    
      2. Add an asychronous event requesting that the initiator ping
         the target.  Send this when approaching the target's
         maximum inactivity time; if the timer expires anyway, the
         connection is assumed to have failed.
    
      In any case, the target must detect connection failure to avoid
      having connections from powered-down clients hang around for
      long periods of time.
    
    
    Disconnecting
    
      When a connection fails, the initiator, target, or both will
      close it.  The initiator can generally not wait around for the
      close to complete before starting a new connection; the target
      will need to accept a new (recovered) connection from an
      initiator, even if the target has not realized the original
      connection's failure.  These are implementation issues.
    
      The initiator may disconnect for reasons other than failure:
    
      - Normal host shutdown (reboot or power off)
      - Application (and disk) failover to another host (e.g. using
        HP, Veritas, or other application failover software).
    
      The target may also disconnect for reasons other than failure:
      
      - If the target is to be rebooted or failed over to another
        physical unit, it may wish to gracefully shut down the connection
        before restarting another.
    
      To make target reboot or failover more graceful, a target should
      attempt to send an asynchronous event "connection shutdown", to
      the initiator on each connection.  This new event contains two
      values:
    
      - MaxUpTime - the number of seconds (can be zero) before this
        connection is expected to cease functioning.  The initiator
        should not attempt to issue more commands than can be expected
        to complete and receive status within this amount of time.
        The target will wait this amount of time before it shuts
        its connections down.
    
      - MinHoldTime - the number of seconds (can also be zero) after
        MaxUpTime before this entity will be available for re-connection.
        After this, the initiator has a good chance of reconnecting
        to the target.  This should be set to the amount of time the
        server is expected to take to fail over, reboot, etc.  We should
        probably define a value (-1?) for "never".  Note that an
        initiator could just reconnect right away, however, it could
        either connect to the running server just before it reboots, 
        or it could lose several SYN segments while waiting for the
        server, causing exponential backoff to make the ultimate
        connection take longer.
    
    Reconnection
    
      The initiator always handles reconnection.  During the new
      connection's login phase, the initiator specifies that it is
      replacing a failed connection by including the non-zero CID
      of the old connection in the RecoverCID field.
    
      If a target supports stateful recovery (meaning it still has
      the cached responses for the session), it accepts the login.
    
      If the target does not support stateful recovery, or the
      target has rebooted and lost its state, or the target has
      dropped the cached responses due to an excessive amount of
      time passing (perhaps 60 seconds), it rejects the login with
      a "reject recovery" status.  The initiator then performs
      a new login, and does stateless recovery.
    
    1. Stateful Recovery
    
      In a stateful recovery, the initiator resends all commands for
      which it has not received status.  If a command has already
      completed, the cached response is returned.  If a command has
      already been issued and is in progress, it is not re-issued;
      and will just be queued somewhere to wait for status.  If
      a command had not been received by the target (or incompletely
      received and thrown away), it will be issued as normal.
    
    2. Stateless Recovery
    
      By default, stateless recovery means that all outstanding
      commands are terminated (to the SCSI layer) (check condition?);
      higher layers must perform recovery.
    
    
    Non-Recovery
    
      Let's face it; there are times when things just can't be recovered
      at this level.  However, there are many higher-level entities that
      may recover for us:
    
      - Tape backup software (reload into a different tape drive)
      - Volume managers (break mirrors)
      - Multipath SCSI drivers (find alternate path or controller)
      - Host application clusters (move app to host with connectivity)
    
      This should be handled as specified in section 4.3.
    
    
    Optimizations:
    
    
    1. If RTT is in use, and a write request is re-sent to a target, and
       the target has already written the data, the target could send the
       Command Response back instead of the RTT.  The initiator would just
       accept this as the final status, and would not have to send the
       write data again.
    
    
    iSCSI Draft 03 Message Modifications:
    
    1. Remove StatRN from the Data Response, or make it equal the StatRN
       for the matching Command Response.  There should be no need for it
       to increment separatly from the Command Response, since this scheme
       assumes that if the response was not received, the data will be
       re-sent anyway.  The current draft does not specify how StatRN
       is used in a Data Response.
    
    2. Add an AckStatRN field to Command Request, to acknowledge the
       highest (honoring wrap) consecutive StatRN received for the
       session.
    
    3. Add a new event (Event Indicator 5) specifying that the connection
       will be closed by the target.  This event sends two parameters
       (using some of the reserved fields):
    
       - MaxUpTime - the number of seconds the target intends to keep
         the connection alive.
       - MinHoldTime - the number of seconds the initiator should wait
         before establishing a new connection.
    
    
    
    Alternative Implementations
    
    1. We considered a separate message to send the AckStatRN, but
       since this is generally done for every command, it seemed simpler
       to just piggyback it on the next command request.
    
    2. CmdRN and StatRN are assumed to be per-session.  If they were
       made per-LUN for any reason, the initiator and target would
       simply have to demux requests and responses based on LUN + RN.
    
    
    A Few Alternatives that Didn't Quite Work
    
       Here are some alternatives we went through, and why we did not 
       choose them:
    
    1. We considered just tossing out StatRN, and acking the CmdRN
       (or the ITT) matching the status last received instead.  However,
       the target needed to keep track of status in the order in which
       it was sent (and NOT in the order in which the original command
       was received) to avoid trouble with commands which incur a long
       response delay (REWIND et al).  Keeping StatRN and acking it
       is much simpler, and makes it easier to preserve response
       ordering if multiple connections are used (and if response
       ordering is required).
    
    2. Target Retries - if one just assumes disk, the target could avoid
       caching responses, and the initiator could avoid acking them; the
       target could just retry any requests, in order, sent over the
       new connection.  However, this excludes most of the SCSI Peripheral
       Device Types, and will likely not work in every case for this
       either.  In these cases, ALL connection recovery would be pushed
       up to SCSI to handle.  By caching status, we remain a truer
       transport, and will work better for these devices (especially tape).
    
    3. Selective StatRN acks - one could individually acknowledge each
       response, to free its resources on the target.  However, if a
       selective ack is lost during a connection recovery, its resources
       would then hang around forever on the target (unless, of course,
       we wanted to ack the ack).  Cumulative acks may be lost at the
       end of a connection; the next command sent will just re-ack
       everything anyway.
    
    
    -- 
    Mark A. Bakke
    NuSpeed, Inc.
    mark.bakke@nuspeed.com
    763.398.1054
    


Home

Last updated: Tue Sep 04 01:08:12 2001
6315 messages in chronological order