RE: IETF mailing list question on Storage over Ethernet/IP

To: <wayland@troikanetworks.com>, ips@ece.cmu.edu
Subject: RE: IETF mailing list question on Storage over Ethernet/IP
From: Michael Krause <krause@cup.hp.com>
Date: Wed, 31 May 2000 06:25:19 -0700
Content-Type: text/plain; charset="us-ascii"; format=flowed
Delivery-Date: Wed May 31 09:39:11 2000
In-Reply-To: <C7CA595F9B9FD311A40D009027DC4A85639C55@host03.troikanetworks.com>
Sender: owner-ips@ece.cmu.edu

At 06:21 PM 5/30/00 -0700, Wayland Jeong wrote:

>This problem is not true of any network. Since FC supports both
>link-level and end-to-end flow-control, the likelihood of packets being
>dropped under congestion situations is mitigated.

Packet drops are bad for many protocols.  However, just is bad is having 
fabric efficiency drop due to head-of-line blocking with long congestion 
timeouts within routing elements.  This limits the effective throughput of 
the fabric and results in higher level application timeouts or additional 
resources within the endnodes to deal with application resource shortages.

>Packets can still expire within congested switches but only after long 
>time-outs occur (RA_TOV). Momentary bursts of congestion can be handled by 
>applying backpressure all the way to the producers. Cisco's WRED (Weighted 
>Random Early Detection) as I understand it, is a mechanism to start 
>dropping packets before congestion becomes critical. The dropping of 
>packets will trigger back-off by TCP.

WRED provides a number of advantages.  It can be invoked when a threshold 
is hit such that one does not just drop the moment congestion is hit.  When 
invoked, it can apply a filter mechanism based on packet attributes, e.g. 
class of service.  In general, WRED has been shown to improve the overall 
efficiency of the fabric while reducing oscillations for applications thus 
delivering overall smoother operation.

>My point here is that in a streaming storage environment, packet loss is 
>very bad. We want to prevent, at all costs, packet loss since losing a 
>packet means triggering an I/O level retry of a very large chunk of data.

This is true for nearly any application I can think of which uses a network.

>I agree, bandwidth is always a solution, but are we going to see a 
>ubiquitous deployment of 10Gbs Ethernet any time soon?

I believe 10 GbE will be available before 10 Gb FC and that will be about 
the same time as this spec becomes sufficiently solid to begin building 
product.  So, the answer is yes.  Also, most workloads can operate quite 
well with GbE if they operate with FC today and one can always aggregate 
GbE links (802.1ad) to provide a fatter pipe while waiting for 10 GbE.

>Okay. I am not familiar enough with today's LAN products to comment
>on their ability to provide near guaranteed QoS (i.e. fractional bandwidth).
>So, QoS applied correctly or proper configuration of networks (i.e.
>matching bandwidth requirements) could alleviate most of the problems.

Yes.

>I am still curious though how windowing works in the SIP model. As I
>understand it, the proposal calls for a single command connection
>(a target implements a well-known TCP server port). After authentication
>of the client (host), a data channel is allocated for that connection.
>Windowing applies to that channel which provides the target, the
>ability to manage buffers with that host. Thus, if a data channel
>communicates with server A, the target will advertise its window size
>which would correspond to its available buffer space for that host.
>But, typically, many hosts will login with one target. Thus, each
>host will have its own data channel and hence its own advertised
>window size. If, say 10 hosts connect to a given target and each
>is allocated a 64KB window size, then the total buffer space
>available at the target must be 640KB. Now, a host will have no
>problem allocating this space, but the congestion point of interest
>is not host memory, but in the target adapter itself (in fact, these
>may be one in the same on a low-cost drive).

1 MB of DRAM costs about $1.  I can buy very low cost adapters today with 1 
MB of DRAM without much problem - a variety of GbE adapters ship with at 
least this much and they are much cheaper than FC adapters.  There are a 
number of adapters out there that support 8, 16, or 32 MB of memory for a 
slight cost delta more and I've seen a few adapters that can support 256 MB 
of memory.

>Now, RTT is one way to coordinate access to the local buffers in
>the target interface which may be acceptable. But, the equivalent
>in FCP is XFER_READY and the intention of this protocol is to
>both pace FCP_WRITES and also give the producer and indication
>of what data is the best data to send. It really has no mapping
>to physical buffers, only cache state.

End-to-end flow control is implemented in both protocols though in two 
slightly different manners.  It is possible to implement a similar credit 
scheme on top of TCP with little difficulty.  I have some ideas on how this 
could be implemented within this spec but am still bouncing them around 
within HP to see if they are in alignment with the overall architecture.

>It seems to me that protocol-level mechanisms for handling
>flow control, like windowing and RTT, are better suited for gross-level
>congestion management. In my opinion providing near zero
>packet loss is not best handled at the protocol-level.

I'd be interested in any data or modeling that suggests one mechanism over 
the other if you have it available.

>Yes, I think I understand how RDMA works. I was only making
>a comment that the thrust of the IETF work is geared towards
>creating an architecture which can yield acceptable performance.
>I think the mapping is fairly straightforward. Making implementations
>which achieve good performance and are cost effective is the real
>challenge.
>
>Now, I saw a comment in the RDMA proposal which said that the
>MSS size should be no more than 8KB to avoid fragmentation. How
>does an MSS of 8KB avoid fragmentation on a 1.5KB MTU Ethernet
>network? I'm sure I'm just missing something here.

The RDMA proposal has some problems in its current form.  Again, RDMA is 
about packet placement and not fragmentation avoidance and thus the 
proposal needs to be fixed.

>I would be interested to see some information on this implementation.
>Is there some public whitepapers or such on this product? I would
>assume that it was on HP-UX.

It is on HP-UX.  There is a whitepaper that will be coming out quite soon 
on the performance - should be out in June if I recall.

>I'm not that familiar with GSN. Is that a HIPPI-based network?

GSN is another name for HIPPI6400.

>Yeah, I guess the bottom-line is that I'm not arguing about limitations in 
>the architecture. I'm more interested in actual implementations. How will 
>one implement this architecture and what kind of performance might one expect?

I don't think the implementation is all that difficult.  Most people might 
leverage some of the SGL algorithms and driver work that is used with say a 
Tachyon-TL implementation and merge this into a good GbE 
implementation.  HP has been looking into how this would be done and have 
not seen anything insurmountable yet.

Mike

References:
- RE: IETF mailing list question on Storage over Ethernet/IP
  - From: Wayland Jeong <wayland@troikanetworks.com>

Prev by Date: RE: IETF mailing list question on Storage over Ethernet/IP
Next by Date: RE: IETF mailing list question on Storage over Ethernet/IP
Prev by thread: RE: IETF mailing list question on Storage over Ethernet/IP
Next by thread: RE: IETF mailing list question on Storage over Ethernet/IP
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:08:14 2001
6315 messages in chronological order