Re: Connection Consensus Progress

To: ips@ece.cmu.edu
Subject: Re: Connection Consensus Progress
From: Stephen Bailey <steph@cs.uchicago.edu>
Date: Fri, 25 Aug 2000 10:12:01 -0500
In-Reply-To: Message from Black_David@emc.com of "Fri, 18 Aug 2000 16:49:52 EDT." <0F31E5C394DAD311B60C00E029101A0704100EC3@corpmx9.isus.emc.com>
Sender: owner-ips@ece.cmu.edu
Sorry this is a little late, I haven't had a chance to send email in
couple days.

> (B) Should iSCSI have a session abstraction that
> 	binds multiple TCP connections into one
> 	iSCSI connection?

You already know this, but I'd say no.

> R1) Parallel transfers to/from and failover support for
> 	tape devices.  In contrast to disks, multiple SCSI
> 	connections to the same tape do not work (e.g.,
> 	blocks can be written in the wrong order).

I'd like to hear from a tape guru who believes that this a) is
important b) will work.  My limited experience in tape is that neither
is the case.

The tape drivers I have dug into use only a single SCSI command at a
time and rely on read-ahead and write-behind buffering in the device
to keep the performance up.  Assuming that this is the case the
performance portion of R1) is subsumed by R2) (parallelism for a
single SCSI data transfer across multiple links), and the failover
support is equivalent to R4).

Plus:

> R1) and R2) are beyond the capabilities of existing SCSI-
> based systems (note that a parallel bus is a single link). 

iSCSI is hard enough as it is, I don't see the point of making it
harder just to provide a capability which has not yet proven wide
applicability.

> R2) Obtaining parallelism for a single SCSI command
> 	across multiple transport connections using
> 	different physical links.

As I have mentioned before, I believe that physical link speeds will
increase at a more than adequate rate, so even if this feature is
designed in, it will not be widely used.  We have already seen a huge
acceleration in the rate at which faster links are coming, and iSCSI
(+ hardware TCP or equivalent) will only increase that rate.

I also think multiple adapter/connections per session will be
incapable of delivering better performance in common circumstances.
One reason is that in order to get good throughput on a link, you need
to ensure that the operation is large enough to a) mask fixed
processing latencies b) ensure sufficient outstanding credit on each
link to mask the latency of returning additional credit.

If you are using N links, your minimum optimal SCSI operation may be
up to N times as large.  The N times as large case will only occur if
there is ONLY a single SCSI op outstanding at a time (the tape case),
because none of the network latencies will be masked by previous and
subsequent operations.  If, in the typical case, there are multiple
outstanding operations, the minimum optimal SCSI operation will not be
N times as large, but it will still need to be larger than the single
link case because of whatever critical path overhead comes from
processing N times as many credit flows.

My experience with current FC targets and various OS initiators is
that the size of single SCSI operations from a typical file system is
already on small side for a single short gigabit link.  The typical
operation size is usually somewhat immutable for a particular OS.
It's usually wedded to fundamental memory management design decisions.
We've been on the wrong side of the `if only the OS would give me
bigger operations, we could really kick ass' enough times that it
seems like a fools game to hope for that.  OS initiators ARE capable
of generating lots of concurrent transfer demand, but it's usually
with more outstanding commands rather than fewer, larger ones.  See
R5) below.

iSCSI is intended to work on networks with larger latencies
(i.e. bigger) than the current batch of storage technologies, so the
link latency effects will become even more pronounced than is commonly
expected now.  We have seen substantial overall performance
degradation on FC running @ 40 km [contrary to the Pittsburgh meeting
minutes, Finisar makes FC transceivers that go 40+km, and maybe other
companies do too], even with a large pool of link credits, because of
inadequate transfer demand to mask the link latency.

Finally, the `iSCSI is hard enough without tackling additional
capabilities of unproven merit' argument applies to this too.

> R3) Obtaining parallelism for a single SCSI command
> 	across multiple transport connections using the
> 	same physical links.
> R3) needs more explanation, as TCP is known to be able
> to saturate Gigabit Ethernet, given enough data to
> transfer.  Is the argument for R3) that for the
> transfer sizes likely to be seen in iSCSI, TCP
> spends enough of its time in slow start and the
> like that multiple TCP connections gain performance?

My hunch is that doing this is horribly poor network citizenship.  If
there is a way to get more performance out of a single end to end
connection, it's the transport's (TCP's) responsibility to get it.
Running multiple connections to end run TCP's congestion avoidance
algorithms has the potential to either slow everybody down or make the
network unstable (which will certainly slow everybody down too).

For that reason, I would suggest that iSCSI should categorically
prohibit this behavior.  If you want to live by the sword (operate
well on a general network), you have to die by the sword (put up with
the inefficiencies required to keep the network healthy).

> R4) Optimize failure handling, so that a single TCP
> 	connection loss doesn't immediately translate
> 	into a SCSI error visible to higher level
> 	(time-consuming) recovery logic.

This seems like a straw-man for several reasons.

First, this requirement suggests that the SCSI layer is not well
adapted to handle errors.  A major part of any SCSI layer is all about
error handling.  However, SCSI layers usually assume that the low
level driver will make allowances for handling media-specific
conditions.

The big problem with non-fatal FC conditions causing fatal SCSI errors
was inadequate FC layer engineering.  Early FC drivers badly abused
the hospitality of the upper SCSI layers.

For example, an event like a LIP (or any other link level event)
typically had some finite duration and was directly detectable by the
driver, so stupid drivers would detect the link failure and
immediately return the SCSI operation with a retriable error code.
The retry operation would come back to the FC driver which would then
observe that the link was still down and fail the operation retriably
again.  This would burn through the retry count instantly and result
in a hard error.  More subtle was when a LIP caused other nodes to LIP
themselves, at some substantial interval later, often to work around
implementation bugs (can you say Tachyon?).  This would lead to many
link up/down transitions in a short period of time.

This is not a hard problem to solve, but many early driver writers did
not contemplate how horrible it was going to be out there on the loop.
One very large company even went so far as to say that FC-AL could
never be implemented reliably and the only solution was to make sure
all their FC was fabric just because they got surprised by the LIP
storms.

A connection drop in iSCSI is essentially a `media' event, and an
iSCSI driver should not immediately fail subsequent operations to the
addressed target without attempting to reestablish the connection
first.  We make this same assumption in SST.  In fact, SST goes so far
as to specify that blowing away a connection by either end is a
perfectly acceptable and expected error recovery strategy in the case
of some infrequent non-nominal conditions.

Second, I do not believe multiple connections will work effectively to
handle errors which can not be handled with appropriate connection
failure recovery strategies.  There are actually two cases.  The first
is a single interface with multiple connections (which I already
suggested should be outlawed in response to R3).  In this case, when
one connection fails, so will the other.  The second is multiple
interfaces, each with a single connection.  In this case, the broken
connection must be discovered before any form of recovery can occur
for the transfers on it.  Having multiple open connections does not
reduce the length of the critical path for recovery, so supporting
multiple connections per iSCSI session can not satisfy this
requirement.

> R5)     Obtaining parallelism between multiple SCSI commands
>         across multiple transport connections using
>         different physical links.

I do not see that this offers anything which can not be achieved with
multiple iSCSI sessions using different physical links.

The only thing it seems to give potentially is link aggregation in the
case where all commands are sent to the target using ordered queue
instead of simple queue.  I've never seen this happen.  Has anybody
else?  Disk drivers use simple queue when they don't care, and some
form of synchronous behavior (unqueued, or just sending one command at
a time) when they care about order.  If the commands are simple queue,
it doesn't matter whether they're sent in a single session or multiple
sessions.

The tape case is discussed under R1).

> Those against should check that none of R1-R4 are important enough
> to be requirements. 

I have also argued in some cases that multiple connections per iSCSI
session would not be capable of effectively satisfying the
requirements.

Don't get me wrong, I'm not arguing that link aggregation is a bad
thing.  It would be great if somehow (magically) it just worked.  It
would be a nice selling point for iSCSI whether or not it is actually
widely used.  I AM arguing that any straightforward proposal is
unlikely to deliver on the promise for reasons which are beyond the
control of the iSCSI standard.  And, more complexity in the standard
will slow down its deployment.

Steph
References:
- Connection Consensus Progress
  - From: Black_David@emc.com
Prev by Date: RE: Re: iSCSI: SCTP Switch and Router support
Next by Date: RE: Data in SCSI Response or SCSI Data
Prev by thread: RE: Connection Consensus Progress
Next by thread: Re: Connection Consensus Progress
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:07:43 2001
6315 messages in chronological order