|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] No Subject
id XAA25025
for <ips@ece.cmu.edu>; Sat, 5 Aug 2000 23:12:30 -0600 (MDT)
Received: from 15.56.8.172 by xboibrg2.boi.hp.com (InterScan E-Mail
VirusWall NT); Sat, 05 Aug 2000 23:12:29 -0600 (Mountain Daylight Time)
Received: by xboibrg2.cv.hp.com with Internet Mail Service (5.5.2650.21)
id <Q1NYGQTK>; Sat, 5 Aug 2000 23:12:29 -0600
Message-ID: <499DC368E25AD411B3F100902740AD652E9728@xrose03.rose.hp.com>
From: "HAAGENS,RANDY (HP-Roseville,ex1)" <randy_haagens@hp.com>
To: "IPS (E-mail)" <ips@ece.cmu.edu>
Subject: Re: Multiple TCP connections
Date: Sat, 5 Aug 2000 23:12:27 -0600
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: multipart/mixed;
boundary="----_=_NextPart_000_01BFFF64.E9DA3EF0"
Sender: owner-ips@ece.cmu.edu
Precedence: bulk
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_000_01BFFF64.E9DA3EF0
Content-Type: text/plain;
charset="iso-8859-1"
This memo recaps some of the reasons for the iSCSI design committee's
chosing multiple TCP connections and the session concept. Also discussed
is
the question of whether TCP connections should be related directly to LUNs.
We chose to support multple TCP connections in order to benefit from
concurrency in the fabric (primarily) and also in end node implementations
(hardware and software). This is related to the stated requirement for
bandwidth aggregation. The notion is that no matter how fast an individual
link (100 Mbps, 1Gbps or 10 Gbps), it will always be desirable to build end
nodes and fabrics that can use multiple links, in parallel, for aggregated
bandwidth.
The existence of the 802.3ad link aggregation standard is evidence that the
Ethernet community values bandwidth aggregation. Unfortunately,
802.3ad-compliant networks will achieve parallel flows on link trunks only
for traffic from different "conversations" (see Pat Thaler's memo dated
8/03). Our understanding is that for today, at least, all level-2 and -3
switches will forward the frames from a single TCP connection over the same
link of a multilink trunk. This is because the hash key used to assign a
frame to a trunk is based on a combination of the MAC and IP source and
destination addresses, plus the TCP source and destination port numbers.
(The more sophisicated the switch, the more of these values may be used in
the hash key.) For a single TCP connection, all of these values remain
identical for all of the frames in that connection. Hence, all of the
frames of that connection will take the same route through the L3 or L2
switched Ethernet fabric.
Pat alludes to the possibility of discriminating at the session layer,
where
session-layer connection IDs would in fact be different. This doesn't
solve
the problem, however, because it would result in all the frames of a
session
taking the same link in a trunk. That's not what we want.
Our understanding is that to leverage existing infrastructure, and achieve
parallel flows through the Ethernet fabric, we must use different TCP
connections (therefore different port number), at the very least. This
practice will allow L4 switches to assign different TCP conversations to
different 802.3ad links. While we're at it, it's helpful also to use
different IP and MAC addresses, so that L3 and L2 switches also will do the
right thing.
For a moment, assume that the IP/Ethernet fabric were able to support
multi-link concurrency for a single TCP stream. Then, the question of
in-order arrival occurs. Unquestionably, in-order arrival would be
preferable, as it would ease the TCP segment re-assembly process.
Arguably,
however, out-of-order arrival could be reasonably handled by a TCP hardware
engine, provided that the time skew of the arrivals was tightly controlled.
(This limits the amount of memory required for the reassembly buffer.) On
the other hand, early hardware implementations of TCP will likely assume
in-order arrival for the fast-path implementation, and escalate to firmware
for handling out-of-order segment arrival, which should normally happen
only
in the error case (dropped segment or IP route change). Allowing the
routine arrival of segments out of order is probably not a wise choice.
Alternatively, it's conceivable that switches could be designed that would
distribute TCP frames across multiple links, while maintaining the order of
their reassembly at the receiving switch ports. Note that the end nodes,
with their multiple links, would also have to participate in such a new
aggregation protocol. This new class of switches, if they were to emerge,
would make it feasible to consider limiting iSCSI sessions to a single
TCP/IP connection, at least for local area Ethernet fabrics. Similar
developments would be required in wide-area switching. Even assuming these
developments, one possible problem would remain: the TCP engine at the two
ends of the link would have to handle the aggregated traffic of several
links. Aggregating TCP connections into a session allows us to deploy
multiple TCP engines, typically one per IP address, and requires only that
the TCP engine implementation scales with the link speed.
The next question is, given multiple TCP connections per end node, how many
should we support? The iSCSI design committee concluded that the right
answer was "several". Consider the case of a multiport storage controller
(or host computer). To use each of the ports, we certainly need one TCP
connection per host per port at a minimum. If 100 host computers, each
with
16 connections to the Ethernet fabric, share a storage array that also has
16 connections to the Ethernet fabric, then the storage array needs to
support 1600 connections, which is reasonable. If the hosts actually use
one connection group (aka "session") for writes, and a second one for
reads,
in order to allow reads that are unordered with respect to those writes,
then 3200 connections are needed. Still reasonable for a large storage
array.
Some have suggested a single connection per LU. This might be reasonable
for a disk that contains only a single LU. But a storage controller
contains today 1024 LUs, and in the future, perhaps 10,000 LUs. Sometimes
an LU will be shared between multiple hosts, meaning that the number of
connections per LU will be greater than one. Assume that 128 hosts are
arranged in 16 clusters of 8 hosts, running a distributed file or database
system between them. Then each LU will have to support 8 host connections.
Assume further that a second connection per host is needed for asynchronous
reads. 16 connections per LU, or 160,000 connections in total. If each
connection state record is 64B (a totally wild guess), this amounts to 10
MB
of memory needed for state records. As a point of comparison,
first-generation TCP hardware acclerators are planned with support for
approximately 1000 connections.
If this weren't bad enough, it turns ot that one (or two, in the case of
asynch reads) connection per LU isn't enough to meet performance
requirements. While the large number of TCP connections required for the
many LUs certainly will deliver enough aggregate throughput for unrelated
traffic, only one (or two) connections are available for a single LU. Bear
in mind that for storage controllers, writes to an LU really are writes to
storage controller memroy, and not to disk. (A background process destages
data to disk, typically at a much lower rate that data is delivered to
cache, due the benefits that the cache provides, which are too nuanced to
go
into here.) Today's storage controllers can absorb write bursts at
typically 1 GB (that's Gigabyte) per second, which would require the
aggregation of 8 1 Gbps Ethernet links. By the time 10 GbE emerges,
storage
controller bandwidth will have scaled up to the 10 GBps range.
Conclusion: one (or two) TCP connections per LU is both too many (resulting
in too much memory devoted to state records) and too few (insufficient
bandwidth for high-speed IO to controller cache). Decoupling the number of
TCP connections from the number of LUs is the necessary result.
If you still don't buy this argument, consider the evolution to
object-based
storage, where SCSI LUs are replaced by objects. Objects may be used for
the same purposes that LUs are today (to contain a file system, for
example); or they may be used to contain file subtrees, individual files,
or
even file extents. They will be much more numerous than LUs.
iSCSI allows the host to bind n TCP connections together into an iSCSI
session, which provides the SCSI "transport" function. The connections of
this session typically will use n different Ethernet interfaces and their
respective TCP engines. The session is connected to an abstract iSCSI
"target", which is a collection of SCSI LUNs named by a url. Within the
session, thousands of IOs may be outstanding at a given time, involving
perhaps 1600 or so LUs (128 hosts are organized into 16 clusters; the
10,000
LUs are divided among the 16 clusters of hosts.)
Because the iSCSI session is a SCSI transport, we've chosen to support
ordered command delivery within the iSCSI session. SCSI requires this
functionality of any transport, so that the SCSI attributes "ordered" and
"simple" will have some meaning. This mechanism dates to the SCSI bus
(which is a link), which always delivers commands in order. Under the
assumption of in-order command delivery, the SCSI device server can
meaningfully use the task attributes to control the order of task
execution.
(Actually, the SCSI SAM-2 equivocates on whether ordered command delivery
is
a requirement; this is probably a compromise to permit support FCP-1, which
doesn't-support ordered command delivery, to be a legal SCSI transport.
Notably, FCP-2 has adopted a command numbering scheme similar to our own,
for in-order command delivery.)
Command ordering is accomplished by numbering the commands. Command
numbering has two additional benefits: (1) We can apply flow control to
command delivery, in order to prevent the hosts from overruning the storage
array; (2) We can know, through a cumulative acknowledgement mechanism,
that
a command has been received at the storage controller. A similar mechanism
is used for reponse message delivery, so that the target can know that its
response (status) message was received at the initiator, and that command
retry will not be subsequently attempted by the host. This permits the
target to discard its command replay buffer.
Sequencing of commands was chosen by the design committee after lengthy
consideration of an alternative: numbering every iSCSI session-layer PDU.
The latter approach actually would have made recovery after TCP connection
failure a lot easier, at least conceptually, since it would be handled at
the iSCSI PDU (message) level, and not at the higher SCSI task (command)
level. But there was a problem in the implementation: the central iSCSI
session layer would need to be involved in numbering every iSCSI PDU sent
by
any of the iSCSI/TCP engines. This would require an undersirable amount of
communication between these engines. The method we've chosen requires only
that commands be numbered as they leave the SCSI layer, and similarly, that
response window variables be updated only when response messages are
returned to the SCSI layer. This assures that iSCSI code in the host will
run only when SCSI code runs, during startIO and completion processing.
R
Randy Haagens
Director, Networked Storage Architecture
Storage Organization
Hewlett-Packard Co.
tel. +1 916 785 4578
e-mail: Randy_Haagens@hp.com
<<Randy Haagens (E-mail).vcf>>
------_=_NextPart_000_01BFFF64.E9DA3EF0
Content-Type: application/octet-stream;
name="Randy Haagens (E-mail).vcf"
Content-Disposition: attachment;
filename="Randy Haagens (E-mail).vcf"
BEGIN:VCARD
VERSION:2.1
N:Haagens;Randy;;;
FN:Randy Haagens (E-mail)
ORG:Hewlett-Packard Company;Architecture and Performance
TITLE:Director, Networked Storage Architecture
TEL;WORK;VOICE:+1 (916) 785-4578
TEL;CELL;VOICE:
TEL;WORK;FAX:+1 (916) 785-1911
ADR;WORK:;Roseville, R5U-P5/R5;8000 Foothills Blvd. MS
5668;Roseville;CA;95747-5668;United States of America
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:Roseville, R5U-P5/R5=0D=0A8000
Foothills Blvd. MS 5668=0D=0ARoseville, CA 95=
747-5668=0D=0AUnited States of America
EMAIL;PREF;INTERNET:Randy_Haagens@hp.com
REV:20000609T224154Z
END:VCARD
------_=_NextPart_000_01BFFF64.E9DA3EF0--
Home Last updated: Tue Sep 04 01:08:02 2001 6315 messages in chronological order |