RE: iSCSI: more on StatRN

To: "Stephen Bailey" <steph@cs.uchicago.edu>, <ips@ece.cmu.edu>
Subject: RE: iSCSI: more on StatRN
From: "Douglas Otis" <dotis@sanlight.net>
Date: Tue, 24 Oct 2000 17:42:48 -0700
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;charset="iso-8859-1"
Importance: Normal
In-Reply-To: <10010242302.AA22112@candide.cs.uchicago.edu>
Sender: owner-ips@ece.cmu.edu

Steph,

What probe rate on a waiting response without other confirmation of
connection happening would you specify?
How do you feel about once every 10 seconds with a three-strike failure
detection level ascertained at the point of sending.  A keep-alive is a
means to determine failure at the OS level within a reasonable amount of
time.  Done at the ULP, the amount of traffic would be kept down for all but
in times where a response is pending.  There should be a slower rate, say
every 60 seconds, for when the connection is idle.  How would you wish to
see this specified?  All this probing traffic also helps justify keeping the
number of connections low.

Doug

> Julian,
>
> > The reason I suggested dropping connections after several
> format errors was
> > tolerance to software "glitches".
>
> 'tolerating' software glitches usually means detecting them where
> possible and making sure that you don't go off in the weeds as a
> result of them.  Unfortunately, most (? should we vote by distinct
> glitches, glitch occurences, or maybe the amount of time (wall clock?
> programmer?) wasted by glitches %^) software glitches are not
> recoverable by mere retry.  They require explicit work-around.
> Therefore, I think the appropriate stance is to specify that the
> detector should hit the source of the glitch with the biggest possible
> hammer (connection reset) immediately.
>
> Obviously, work-arounds will happen, and as a result, they'll violate
> the SHALLs in the spec, but the fact is they're already addressing
> other violations of the SHALL.  No big deal.
>
> > The Check Condition is meant for cases in which SCSI can act -
> and yes from
> > the transport POV the command has finished.
>
> I guess the only point I'm trying to make is that I don't think SCSI
> status should be used for conditions which are not already defined in
> SAM/T10.  FCP and SST both define a `response' status mechanism which
> is used to report conditions which can be reported in-line, but are
> not SCSI generic.  For example, conflicting option flag settings in
> the CMD PDU (other than those in the CDB).  A key point (of which
> you're probably already aware), is that any error which CAN be
> reported in-line should be reported in-line, to improve overall
> responsiveness.
>
> If you're already on top of all that, and I'm preaching to the choir,
> right on.  If not, there it is.
>
> > Dropped PDUs will help us avid DOS attacks with badly formed PDUs.
>
> What's the DOS attack that this addresses?  Certainly PDUs outside a
> connection will be dropped, but at the TCP layer before iSCSI ever
> sees it.  Once an iSCSI connection is established, I don't see how
> you're any more open or protected from a DOS attack.  Specifically,
> you initiate a TCP connection close on the first bogus PDU, and while
> you're closing you ignore everything that's not part of the close
> protocol, right?
>
> > And I will suggest activating the TCP keep alive option for
> early detection
> > of link failures.
>
> TCP keep alive has a chequered history, and may not be the right thing
> here.  Stevens said somewhere (TCPI I think), that it's more chic to
> have the ULP do keep alive if desired, which is where this whole
> discussion started.
>
> As long as you have no outstanding operations on a connection, neither
> end probably needs (or wants, if you believe Stevens' arguments) a
> keep alive.  Once you have operations in progress, the initiator is
> already keeping timers on every operation, so connection failure can
> initially be detected in that way.
>
> The reason why we specified a connection viability check on operation
> timeout in SST is to improve responsiveness during link failures.  You
> don't NEED to do the viability test at all, in which case, each
> operation will fail under its own timeout.  However, badly engineered
> FC implementations have shown that it's important to detect failure as
> early as possible where ever possible.  Otherwise the system can get
> extremely sluggish.
>
> And then there's the issue of the target recovering resources in a
> bounded amount of time.  In SST we specified that the target shall
> perform keep alives for this reason.  In iSCSI, I would suggest that
> it would be approprate to specify that targets MAY perform an iSCSI
> keep alive when they have live commands on a connection if they care
> about recovering their resources.
>
> The key thing to remember about keep alives is that iSCSI endpoints
> may have extremely high connectivity degree, but are likely to have
> many inactive connections.  Having everybody banging away on each
> other with keep alives could have a substantial cost (or was everybody
> planning to hardware accelerate the keep alives :-?)
>
> Steph
>

Follow-Ups:
- Re: iSCSI: more on StatRN
  - From: Stephen Bailey <steph@cs.uchicago.edu>

References:
- Re: iSCSI: more on StatRN
  - From: Stephen Bailey <steph@cs.uchicago.edu>

Prev by Date: Re: iSCSI: more on StatRN
Next by Date: iSCSI: error recovery
Prev by thread: Re: iSCSI: more on StatRN
Next by thread: Re: iSCSI: more on StatRN
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:06:36 2001
6315 messages in chronological order