RE: IETF mailing list question on Storage over Ethernet/IP

To: "'Michael Krause'" <krause@cup.hp.com>, Wayland Jeong <wayland@troikanetworks.com>, ips@ece.cmu.edu
Subject: RE: IETF mailing list question on Storage over Ethernet/IP
From: Wayland Jeong <wayland@troikanetworks.com>
Date: Tue, 30 May 2000 18:21:27 -0700
Content-Type: text/plain;charset="iso-8859-1"
Delivery-Date: Tue May 30 21:22:21 2000
Sender: owner-ips@ece.cmu.edu

 At 10:49 AM 5/26/00 -0700, Wayland Jeong wrote:
> 
> >I would argue, though, that congestion control in Fibre Channel networks
is
> >better (of course, FC is not completely immune from losing frames) than
an 
> >Ethernet network. The link-level buffer-to-buffer credit mechanisms and 
> >the hardware end-to-end flow-control offered in Class-2 services help 
> >ensure that packet loss is at a minimum. Traffic patterns in storage 
> >networks tend to be more congestion oriented than normal peer-to-peer LAN

> >traffic patterns. Most SAN's have many initiators (hosts) and very few 
> >targets (storage devices, like RAID's). Thus, congestion can occur quite 
> >often due to the funneling of traffic to shared storage. Flow-control in 
> >FC helps take care of this situation.
> 
> This problem will be true of any application which shares a 
> common point of 
> contention including FC.  As links evolve to 10 Gbps speeds, 
> the number of 
> links into a given device may be reduced and thus increase 
> contention and 
> fabric backpressure.  As such, the contention may result in 
> packet loss as 
> fabric forward progress timers and other mechanisms such as 
> WRED kick in to 
> alleviate the congestion via packet drops.  Note: I believe Cisco and 
> others have numerous papers on how WRED can smooth out 
> traffic flow so that 
> congestion is managed fairly well even under heavy load.
> 
This problem is not true of any network. Since FC supports both 
link-level and end-to-end flow-control, the likelihood of packets being
dropped under congestion situations is mitigated. Packets can still
expire within congested switches but only after long time-outs occur
(RA_TOV). Momentary bursts of congestion can be handled by 
applying backpressure all the way to the producers. Cisco's WRED
(Weighted Random Early Detection) as I understand it, is a 
mechanism to start dropping packets before congestion becomes
critical. The dropping of packets will trigger back-off by TCP.  My
point here is that in a streaming storage environment, packet loss
is very bad. We want to prevent, at all costs, packet loss since
losing a packet means triggering an I/O level retry of a very large
chunk of data. 

I agree, bandwidth is always a solution, but are we going to see a 
ubiquitous deployment of 10Gbs Ethernet any time soon? 


> >I'm curious about how IPS will behave in these 
> configurations. Is the 
> >windowing mechanism in TCP sufficient? Will there need to be 
> XON/XOFF 
> >flow-control to help reduce packet loss?
> 
> XON/XOFF are link-level constructs and not transport-level.  
> As such, they 
> do not really apply.  TCP windows provides both sides with an 
> understanding 
> of how much buffering is available for that connection. To deal with 
> bandwidth management issues, there are numerous techniques such as 
> transparent window adjustments that are possible to 
> implement, i.e. traffic 
> shaping.  It is also possible to use the IP-based QoS to 
> adjust the network 
> arbitration policies such that traffic can be segregated.  
> This will allow 
> the network to determine which traffic gets serviced first and thus 
> provides a degree of control for reducing the applications view of 
> congestion and the oscillations that may result in reaction to that 
> congestion.
> 
> Note: As with many congestion-related problem and given the continual 
> downward spiral of costs for many links, e.g. ethernet, many 
> customers will 
> simply through bandwidth at the problem within the fabric 
> itself and then 
> adjust the arbitration and traffic shaping parameters (either 
> directly 
> within their applications or transparently within the fabric).
> 
Okay. I am not familiar enough with today's LAN products to comment
on their ability to provide near guaranteed QoS (i.e. fractional bandwidth).
So, QoS applied correctly or proper configuration of networks (i.e. 
matching bandwidth requirements) could alleviate most of the problems.

I am still curious though how windowing works in the SIP model. As I
understand it, the proposal calls for a single command connection 
(a target implements a well-known TCP server port). After authentication
of the client (host), a data channel is allocated for that connection.
Windowing applies to that channel which provides the target, the
ability to manage buffers with that host. Thus, if a data channel 
communicates with server A, the target will advertise its window size
which would correspond to its available buffer space for that host.
But, typically, many hosts will login with one target. Thus, each 
host will have its own data channel and hence its own advertised
window size. If, say 10 hosts connect to a given target and each
is allocated a 64KB window size, then the total buffer space
available at the target must be 640KB. Now, a host will have no
problem allocating this space, but the congestion point of interest
is not host memory, but in the target adapter itself (in fact, these 
may be one in the same on a low-cost drive).

Now, RTT is one way to coordinate access to the local buffers in
the target interface which may be acceptable. But, the equivalent
in FCP is XFER_READY and the intention of this protocol is to
both pace FCP_WRITES and also give the producer and indication
of what data is the best data to send. It really has no mapping
to physical buffers, only cache state.

It seems to me that protocol-level mechanisms for handling 
flow control, like windowing and RTT, are better suited for gross-level
congestion management. In my opinion providing near zero
packet loss is not best handled at the protocol-level.

> >I certainly don't argue with both the merits and the 
> technical feasibility
> >of putting block storage over a TCP transport. It seems 
> quite doable. The 
> >question is, how to get good performance out of real 
> implementations. 
> >Certainly, proposals such as RDMA are trying to address that 
> very concern 
> >(reduce system overhead due to datagram re-assembly).
> 
> RDMA only defines how to place the data on the remote for 
> WRITEs or on the 
> requester for READs.  It does not define a SAR solution.  TCP 
> attempts to 
> avoid SAR operations (IP fragmentation) by only transmitting 
> MSS sized packets.
>
Yes, I think I understand how RDMA works. I was only making
a comment that the thrust of the IETF work is geared towards 
creating an architecture which can yield acceptable performance.
I think the mapping is fairly straightforward. Making implementations
which achieve good performance and are cost effective is the real
challenge.

Now, I saw a comment in the RDMA proposal which said that the
MSS size should be no more than 8KB to avoid fragmentation. How
does an MSS of 8KB avoid fragmentation on a 1.5KB MTU Ethernet
network? I'm sure I'm just missing something here.

> 
> >Even though, TCP compensates for a lossy network, to achieve 100MBs
> >streaming throughput, you don't want to have TCP, even in 
> hardware, performing
> >datagram retry. If you look at SLIC from Alacritech, they 
> don't perform 
> >retry in hardware. Only the datapath is implemented in 
> hardware. Retry, 
> >connection management and IP fragmentation are all still 
> handled in software.
> 
> This is implementation specific and not a function of the 
> architecture.  It 
> is possible to perform these types of operations within 
> hardware and it has 
> been done for quite some time, albeit simplistic in many 
> cases, for those 
> solutions using tftp or discless devices (e.g. workstations). 
>  The solution 
> does not require software - that is an implementation option. 
>  Please note 
> that efforts such as InfiniBand, which may be argued as at 
> least as complex 
> as TCP/IP, does not mandate a hardware-only implementation 
> but many will 
> implement one for most operations: SAR, packet 
> retransmissions, most error 
> operations / recovery, etc.
>
I don't know anyone contemplating an Infiniband implementation
that is not in either hardware or firmware. I would argue, as well, 
that error recovery is not something that is typically handled by 
the hardware. The assumption is that the network is reliable. 
In other words, lost data is the exception rather than the norm.

> 
> Note: HP has been shipping TCP running over GbE running at 
> link rate (i.e. 
> 940 Mbps of data payload) using standard MTU packets for quite some 
> time.  I have seen one netperf on a GSN implementation report 
> over 4 Gbps 
> of data payload.  In the GbE case, the CPU was not fully 
> consumed; I do not 
> have the specifics (MTU, CPU consumption) on the GSN but 
> given most of its 
> data transmission and recovery is implemented in hardware, 
> clearly it is 
> possible to achieve very high bandwidth using standard as 
> well as high-end 
> link types while running the TCP/IP protocol.
>
I would be interested to see some information on this implementation.
Is there some public whitepapers or such on this product? I would
assume that it was on HP-UX.

I'm not that familiar with GSN. Is that a HIPPI-based network?

> 
> > > 2. Jumbo frames will not be necessary when TCP is 
> implemented in hardware.
> > > Most FC implementations use 1024 byte frames, and 
> performance is very
> > > adequate, given hardware implementation of FCP.
> > >
> >This is untrue. Most FC implementations use full 2KB frames 
> and are capable
> >of very large sequences (i.e. greater than the 64KB 
> limitation imposed by 
> >the IPS
> >proposal), unless there is something that I don't know about 
> HP's Tachyon 
> >chips ;-)
> 
> I do not see what the issue is here.  In any solution, the 
> application 
> posts a SGL to process and then the underlying hardware 
> performs the SAR 
> operations and transmits / receives the data.  There should be no 
> architectural limit (in most cases the limit is very large 
> and is really a 
> function of the hardware implementation) on how large a SGL 
> is only how 
> much of it can be transmitted in a single window at a time and then 
> determining whether the window is sufficiently large to support the 
> distance and application requirements.  An implementation 
> could take a 
> large SGL transaction and then slice it into an appropriate 
> window for the 
> environment it is operating in without violating the 
> transport capabilities.
>
Yeah, I guess the bottom-line is that I'm not arguing about 
limitations in the architecture. I'm more interested in actual
implementations. How will one implement this architecture
and what kind of performance might one expect?

> 
> Mike
> 
> 
Thanks for the feedback.

-Wayland
Follow-Ups:
- RE: IETF mailing list question on Storage over Ethernet/IP
  - From: Michael Krause <krause@cup.hp.com>
Prev by Date: Re: Peter Johansson: Re: IETF mailing list question on Storage over Ethernet/IP
Next by Date: RE: IETF mailing list question on Storage over Ethernet/IP
Prev by thread: Re: IETF mailing list question on Storage over Ethernet/IP
Next by thread: RE: IETF mailing list question on Storage over Ethernet/IP
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:08:15 2001
6315 messages in chronological order