RE: iSCSI: error recovery

To: "Matt Wakeley" <matt_wakeley@agilent.com>, "IPS Reflector" <ips@ece.cmu.edu>
Subject: RE: iSCSI: error recovery
From: "Douglas Otis" <dotis@sanlight.net>
Date: Tue, 24 Oct 2000 20:46:49 -0700
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;charset="iso-8859-1"
Importance: Normal
In-Reply-To: <39F6276A.77196DFC@agilent.com>
Sender: owner-ips@ece.cmu.edu

Matt,

> There has been a lot of discussion on how the Status reference numbers are
> (can be) used for error detection and recovery.  There is even a
> (optional)
> method for numbering the Data PDUs now.
>
> Let's clarify what "errors" we are trying to recover from, and
> how the RNs are
> meant to be used.  The example is as follows.  An iSCSI session
> has multiple
> TCP connections over *separate* physical links.  If one of the
> physical links
> fails, it is desirable to "recover" the SCSI I/Os that were
> occurring on the
> TCP connection(s) that were established over that link.  We should *not*
> attempt to recover "errors" that are caused due to data being
> discarded after
> it has been delivered from TCP to the upper layers (iSCSI, SCSI,
> whatever).
>
> Now there has already been discussion on how the TCP timeouts are
> (generally)
> longer than most SCSI command timeouts, so I'm only discussing link errors
> that can be detected fairly quickly.  For example, if the
> physical link gets
> yanked, the MAC can relatively quickly determine the link is down
> and notify
> the appropriate management entity.

For the connection failure not to cause a serious disruption on the SCSI
layer, more than just assumptions about the means of notification is
important.  That was the reason for emphasis on providing an ULP means of
detecting a failure that would be timely in preventing transport failure
events.

> The goal is to have a mechanism for the initiator to determine
> what commands
> are outstanding on the failed connection. Likewise, it's desirable for the
> target to retain the data and/or status of I/Os until they are
> acknowledged by
> the initiator, so that in the event of a link failure, the target
> can "replay"
> the I/O.

Replay would not be the fastest means of recovery.  To take advantage of all
the numbering of commands, data, and status responses, simply sending a NOP
upon reconnection would instruct the target what status it contains remains
undelivered.  Only unconfirmed commands would then need to be replayed.

> >From the ExpCmdRN, the initiator knows which commands it sent on
> the failed
> TCP connection where received by the target and those that were not.  Any
> commands received by the target, but not completed (no status pdu received
> before the failure) should be resent on another TCP connection with the
> "retry" bit set.  Any commands not received by the target are resent on
> another TCP connection without the "retry" bit.

An area to be examined is adapter or connection allegiance requirements.  As
recovery must allow commands and status to transverse any connection, it
does not make sense to restrict traffic to a single adapter.  There are many
global variables used to create session wide numbering.  If there are
critical structures limited to a single adapter, how would a failure be
handled in the event of an adapter failure?  Should the adapter determine
failures?  If there is a layer above adapters tracking progress, this layer
should allow adapter independence.  There must be a means to free resources
on adjacent adapters following the completion of a task started on one
adapter and completed on another during recovery.  A good means to ensure a
separation of this supervisory function would be to insist there is no
adapter allegiance.

> The target keeps the context (status and maybe data) of SCSI I/Os it's
> executed until it has positive acknowledgment from the initiator
> that the I/O
> is complete at the initiators end.  This acknowledgment is
> indicated in the
> ExpStatRN received from the initiator.  Acknowledged I/Os are
> then deallocated
> in the target.
>
> Now for some issues I have with the (current) iSCSI draft:
>
> In section 2.2.2 it states "As the only cause for long delays in
> responses can
> be failed connections and received responses free-up resources,
> we felt that
> score boarding responses at the initiator could be accomplished by simple
> bitmaps and there is no need to flow-control responses."
>
> Score boarding, especially with bit maps,  is an operation that can be
> somewhat CPU heavy in the normal "performance path" of the iSCSI
> layer. If the
> ExpStatRN was local to each TCP connection, rather than global across the
> iSCSI session, then there would be no requirement for score boarding.  The
> initiator would simply increment the StatRN received on each
> connection for
> use in the ExpStatRN for that connection.

Initiator Tags may be viewed as 32 bit random numbers.  Handling resources
autonomously at the adapter level, identification of resources must then be
done through a Initiator Tag conversion to a target based structure.
Assurance of idempotent processing becomes difficult as there would be
nothing that would relate adapter sequences and, as adaptors would be
autonomous, even acknowledgement would be suspect.  Because of a need for
global oversight and logging, the desire to push sequencing and
acknowledgement to separate adapters and thus necessitate allegiance becomes
questionable at a reliability standpoint.  To ensure a good design, a
session should be processed globally without adapter allegiance.

> >From an earlier email: "1.1.1.3   Data PDU numbering
> Incoming Data PDUs MAY be numbered by a target to enable fast
> recovery of long
> running READ commands. Data PDUs are numbered with DataRN.  NOP
> command PDUs
> carrying the same Initiator Tag as the Data PDUs are used to
> acknowledge the
> incoming Data PDUs."
>
> Since the only "error" we are trying to recover from is the very
> rare event
> that a physical link fails, I fail to see what the benefit is to
> be able to
> "recover" at the PDU level.  Plus, you'll have to build into the
> protocol a
> mechanism to request retransmission of particular data PDUs.
> Let's simplify
> and just send the command with the "retry" bit set.

As data is not controlled within this protocol, it is likely data is dropped
without a link failure. Recommend size of a transfer be restricted so that
extraordinary means of recovery is not required.

> Also from an earlier email:
>
> > >Mallikarjun,
> > >
> > >Thanks for your comments.
> > >
> > >Initiator scoreboarding is not considered. I will try to emphasize this
> > >even more in the new draft.
> > >The party responsible for reporting length is the target.  As
> overlapping
> > >ranges are not explicitly
> > >forbidden this would be a harder task than apparent. Reporting counts
> > >becomes entirely a question of faith!
> >
> > I didn't realize that (what FC calls as) data overlay is allowed, FCP
> > requires this initiator capability to be explicitly stated in session
> > establishment (process login).  Is there a particular reason why this
> > is chosen to be allowed by default in iSCSI?
>
> Again, in the interests of simplicity, I request that data overlay be
> forbidden.  Period.  Otherwise, the initiator would have to perform score
> boarding at the byte level to be positively sure that each byte was really
> received.

If you are encapsulating a FC drive, then the level of virtualization you
are insisting upon is beyond practical. Even the present iSCSI protocol
allows data to be dropped such that there will be overlaps on requests as
well as delivery.

With all this said, in an ideal world, consider redundant server design.
Different manufactures should be able to fail-over access to the SAN as well
as any adapter should be able to fail-over.

Doug

References:
- iSCSI: error recovery
  - From: Matt Wakeley <matt_wakeley@agilent.com>

Prev by Date: iSCSI: error recovery
Next by Date: Re: iSCSI: error recovery
Prev by thread: iSCSI: error recovery
Next by thread: Re: iSCSI: error recovery
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:06:35 2001
6315 messages in chronological order