SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: Connection Consensus Progress



    Sorry this is a little late, I haven't had a chance to send email in
    couple days.
    
    > (B) Should iSCSI have a session abstraction that
    > 	binds multiple TCP connections into one
    > 	iSCSI connection?
    
    You already know this, but I'd say no.
    
    > R1) Parallel transfers to/from and failover support for
    > 	tape devices.  In contrast to disks, multiple SCSI
    > 	connections to the same tape do not work (e.g.,
    > 	blocks can be written in the wrong order).
    
    I'd like to hear from a tape guru who believes that this a) is
    important b) will work.  My limited experience in tape is that neither
    is the case.
    
    The tape drivers I have dug into use only a single SCSI command at a
    time and rely on read-ahead and write-behind buffering in the device
    to keep the performance up.  Assuming that this is the case the
    performance portion of R1) is subsumed by R2) (parallelism for a
    single SCSI data transfer across multiple links), and the failover
    support is equivalent to R4).
    
    Plus:
    
    > R1) and R2) are beyond the capabilities of existing SCSI-
    > based systems (note that a parallel bus is a single link). 
    
    iSCSI is hard enough as it is, I don't see the point of making it
    harder just to provide a capability which has not yet proven wide
    applicability.
    
    > R2) Obtaining parallelism for a single SCSI command
    > 	across multiple transport connections using
    > 	different physical links.
    
    As I have mentioned before, I believe that physical link speeds will
    increase at a more than adequate rate, so even if this feature is
    designed in, it will not be widely used.  We have already seen a huge
    acceleration in the rate at which faster links are coming, and iSCSI
    (+ hardware TCP or equivalent) will only increase that rate.
    
    I also think multiple adapter/connections per session will be
    incapable of delivering better performance in common circumstances.
    One reason is that in order to get good throughput on a link, you need
    to ensure that the operation is large enough to a) mask fixed
    processing latencies b) ensure sufficient outstanding credit on each
    link to mask the latency of returning additional credit.
    
    If you are using N links, your minimum optimal SCSI operation may be
    up to N times as large.  The N times as large case will only occur if
    there is ONLY a single SCSI op outstanding at a time (the tape case),
    because none of the network latencies will be masked by previous and
    subsequent operations.  If, in the typical case, there are multiple
    outstanding operations, the minimum optimal SCSI operation will not be
    N times as large, but it will still need to be larger than the single
    link case because of whatever critical path overhead comes from
    processing N times as many credit flows.
    
    My experience with current FC targets and various OS initiators is
    that the size of single SCSI operations from a typical file system is
    already on small side for a single short gigabit link.  The typical
    operation size is usually somewhat immutable for a particular OS.
    It's usually wedded to fundamental memory management design decisions.
    We've been on the wrong side of the `if only the OS would give me
    bigger operations, we could really kick ass' enough times that it
    seems like a fools game to hope for that.  OS initiators ARE capable
    of generating lots of concurrent transfer demand, but it's usually
    with more outstanding commands rather than fewer, larger ones.  See
    R5) below.
    
    iSCSI is intended to work on networks with larger latencies
    (i.e. bigger) than the current batch of storage technologies, so the
    link latency effects will become even more pronounced than is commonly
    expected now.  We have seen substantial overall performance
    degradation on FC running @ 40 km [contrary to the Pittsburgh meeting
    minutes, Finisar makes FC transceivers that go 40+km, and maybe other
    companies do too], even with a large pool of link credits, because of
    inadequate transfer demand to mask the link latency.
    
    Finally, the `iSCSI is hard enough without tackling additional
    capabilities of unproven merit' argument applies to this too.
    
    > R3) Obtaining parallelism for a single SCSI command
    > 	across multiple transport connections using the
    > 	same physical links.
    > R3) needs more explanation, as TCP is known to be able
    > to saturate Gigabit Ethernet, given enough data to
    > transfer.  Is the argument for R3) that for the
    > transfer sizes likely to be seen in iSCSI, TCP
    > spends enough of its time in slow start and the
    > like that multiple TCP connections gain performance?
    
    My hunch is that doing this is horribly poor network citizenship.  If
    there is a way to get more performance out of a single end to end
    connection, it's the transport's (TCP's) responsibility to get it.
    Running multiple connections to end run TCP's congestion avoidance
    algorithms has the potential to either slow everybody down or make the
    network unstable (which will certainly slow everybody down too).
    
    For that reason, I would suggest that iSCSI should categorically
    prohibit this behavior.  If you want to live by the sword (operate
    well on a general network), you have to die by the sword (put up with
    the inefficiencies required to keep the network healthy).
    
    > R4) Optimize failure handling, so that a single TCP
    > 	connection loss doesn't immediately translate
    > 	into a SCSI error visible to higher level
    > 	(time-consuming) recovery logic.
    
    This seems like a straw-man for several reasons.
    
    First, this requirement suggests that the SCSI layer is not well
    adapted to handle errors.  A major part of any SCSI layer is all about
    error handling.  However, SCSI layers usually assume that the low
    level driver will make allowances for handling media-specific
    conditions.
    
    The big problem with non-fatal FC conditions causing fatal SCSI errors
    was inadequate FC layer engineering.  Early FC drivers badly abused
    the hospitality of the upper SCSI layers.
    
    For example, an event like a LIP (or any other link level event)
    typically had some finite duration and was directly detectable by the
    driver, so stupid drivers would detect the link failure and
    immediately return the SCSI operation with a retriable error code.
    The retry operation would come back to the FC driver which would then
    observe that the link was still down and fail the operation retriably
    again.  This would burn through the retry count instantly and result
    in a hard error.  More subtle was when a LIP caused other nodes to LIP
    themselves, at some substantial interval later, often to work around
    implementation bugs (can you say Tachyon?).  This would lead to many
    link up/down transitions in a short period of time.
    
    This is not a hard problem to solve, but many early driver writers did
    not contemplate how horrible it was going to be out there on the loop.
    One very large company even went so far as to say that FC-AL could
    never be implemented reliably and the only solution was to make sure
    all their FC was fabric just because they got surprised by the LIP
    storms.
    
    A connection drop in iSCSI is essentially a `media' event, and an
    iSCSI driver should not immediately fail subsequent operations to the
    addressed target without attempting to reestablish the connection
    first.  We make this same assumption in SST.  In fact, SST goes so far
    as to specify that blowing away a connection by either end is a
    perfectly acceptable and expected error recovery strategy in the case
    of some infrequent non-nominal conditions.
    
    Second, I do not believe multiple connections will work effectively to
    handle errors which can not be handled with appropriate connection
    failure recovery strategies.  There are actually two cases.  The first
    is a single interface with multiple connections (which I already
    suggested should be outlawed in response to R3).  In this case, when
    one connection fails, so will the other.  The second is multiple
    interfaces, each with a single connection.  In this case, the broken
    connection must be discovered before any form of recovery can occur
    for the transfers on it.  Having multiple open connections does not
    reduce the length of the critical path for recovery, so supporting
    multiple connections per iSCSI session can not satisfy this
    requirement.
    
    > R5)     Obtaining parallelism between multiple SCSI commands
    >         across multiple transport connections using
    >         different physical links.
    
    I do not see that this offers anything which can not be achieved with
    multiple iSCSI sessions using different physical links.
    
    The only thing it seems to give potentially is link aggregation in the
    case where all commands are sent to the target using ordered queue
    instead of simple queue.  I've never seen this happen.  Has anybody
    else?  Disk drivers use simple queue when they don't care, and some
    form of synchronous behavior (unqueued, or just sending one command at
    a time) when they care about order.  If the commands are simple queue,
    it doesn't matter whether they're sent in a single session or multiple
    sessions.
    
    The tape case is discussed under R1).
    
    > Those against should check that none of R1-R4 are important enough
    > to be requirements. 
    
    I have also argued in some cases that multiple connections per iSCSI
    session would not be capable of effectively satisfying the
    requirements.
    
    Don't get me wrong, I'm not arguing that link aggregation is a bad
    thing.  It would be great if somehow (magically) it just worked.  It
    would be a nice selling point for iSCSI whether or not it is actually
    widely used.  I AM arguing that any straightforward proposal is
    unlikely to deliver on the promise for reasons which are beyond the
    control of the iSCSI standard.  And, more complexity in the standard
    will slow down its deployment.
    
    Steph
    


Home

Last updated: Tue Sep 04 01:07:43 2001
6315 messages in chronological order