Re: NFS Header/data parsing and RDMA

To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: NFS Header/data parsing and RDMA
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Date: Mon, 28 Feb 2000 10:43:01 -0700 (MST)
Delivery-Date: Mon Feb 28 12:43:45 2000
Sender: owner-ips@ece.cmu.edu

> From: Costa Sapuntzakis <csapuntz@cisco.com>

> Ok, so doing NFSv2/v3 header/data splitting is easy on an in-order
> TCP stream because NFS has fixed-length trailers. Here's a little
> technique:
> ...

> Note, to do this with NFS/TCP, your NIC has to do some primitive
> level of TCP processing (at least keep track of flows). It also
> needs to understand RPC/TCP message boundaries.

Do I understand correctly that you're applying the familiar
NFS/UDP page flipping tactic to NFS/TCP?

> Are there significantly simpler approaches than this? 

1. How about using NFS/UDP instead of NFS/TCP?
  It's well known in the NFS community that NFSv2-3/TCP is no faster or
  otherwise better than NFSv2-3/UDP except over very narrow or at least
  rather long pipes.  (Recall also the congestion control and avoidance
  mechanisms in some NFSv2-3/UDP implementations.)

2. Use NFS/TCP, but send every RPC/XDR transaction in a single TCP segment,
  and use IP fragmentation to fit the MTU.  This tactic was used for 10+
  years ago in the FDDI adapters of some super computers.  It does have
  the problems of IP fragmentation, but those problems are rarely
  encountered where NFS is used.

> NFSv4 doesn't seem to have fixed length trailers and neither
> does CIFS in all cases. And it looks like it will be costly to parse 
> NFSv4 headers. 

I've not been paying attention to NFSv4.  A quick skim of the draft
suggests that it will not displace NFSv2/3 in the environments where NFS
is currently popular.  NFSv4 certainly has nothing to do with anything
like SCSI over IP.  I'm also far from convinced that NFSv4 has got some
of the extensions close enough to the underlying real filesystems to be
popular.  Even if I'm wrong, it will be years before NFSv4 is widely used
While I think there are ways to page flip NFSv4 without special hardware,
I don't think they are worth talking about yet.  Even if I'm also wrong
about that, it is years early to be modifying TCP/IP to support NFSv4.
No one can see what NFSv4 will be like when it is popular enough to justify
modifying TCP today, if NFSv4 ever is popular.


> RDMA still has the following features:
>
> - Per-packet (Works with arbitrary out-of-order reception of TCP
> segments)
> - Fixed header that's generic across all protocols (NFSv4, v5, AFS,
> DFS, CIFS, etc..) 
> - No page flipping necessary on solicited transfers
> - Message boundary bit (which is admittedly orthogonal to RDMA) allows
> out-of-order processing on TCP receive buffer. Decreases parsing latency,
> esp. in the face of packet drops.
> ...

Knowing to which buffer an out-of-order TCP segment belongs is something
that I don't see how to do without something like RDMA.  However,
out-of-order TCP segments are both very rare and very bad for TCP
performance, regardless of whether RDMA is present.  Out of order
TCP segments must be even more rare in storage networks.

Talk about NFSv5 or even AFS/DFS does the opposite of make me think there
might be something good in RDMA.  And as I've said, it's years too early
to justifiy RDMA with NFSv4.

With existing techniques, if you don't want to page flip, you don't need
to.  If you are able to provide enough distinct application buffer streams
to the NIC for RDMA, then you could do the same for other techniques.

What's that about "parsing latency" and what does it have to do with 
lost segments?  Are you proposing to deliver TCP data to applications
out of order?  I trust not!

   ....

] From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>

] ...
]No, that situation doesn't require any hardware support.  However, a zero-copy 
] receive path is not the only element of RDMA - RDMA was designed (I suppose 
] from the discussion here) specifically to address header/payload issues for 
] storage protocols.  Clearly one can do zero-copy receive with changes to the 
] API and no hardware/firmware modifications.  But with no special hardware 
] support, flipping the payload into some page with alignment constraints will 
] require another copy.

What about the many systems that have been page flipping NFS in and out
of buffer caches for more than 10 years, with no changes to APIs or special
silicon?

]There is one exception to my last statement that I know of:  If you pre-adjust 
]the hardware receive buffers to make the payload align on a page boundary, you 
] can flip the page into the buffer cache for (hopefully) the common case.  
] However, this requires the ability to tune these header offsets and will only 
] work for one protocol at a time (mostly).

The page flipping systems I've worked on did not tune header offsets and
worked on more than one protocol.  (Given your email address, it might be
interesting to check the old IRIX source trees.  Besides the NFS kernel
code and the HIPPI, ATM, and FDDI drivers and firmware, check cmd/rcp and
cmd/rsh.)  UDP page flipping is trivial on protocols that have no trailers.
It requires trivial smarts in the NIC and much simpler buffer allocation
by the NIC than RDMA requires.  (I suspect RDMA needs pools of buffers
for every stream, while the classic tactic needs only two pools, "little"
and "pages"....well, for tiny improvements I've also done it with "little",
"medium" and "pages".)

] Realistically, who is going to be running a storage system that requires so 
] much bandwidth that avoiding receive copies is necessary, and runs on generic 
] NICs with no firmware/ASIC modifications possible?  So I think using modified 
] hardware is completely reasonable in those circumstances.
] ...

Even more reasonable than special hardware are modified API's and protocols
and other steps, including ensuring that out-of-order packets are very
rare, and with header offsets are few, fixed, known, and friendly.

How would you have out-of-order arrival on a storage network, other than
due to bit rot in the wires, and what storage network is going to have
significant bit rot?


Vernon Schryver    vjs@rhyolite.com

Prev by Date: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Next by Date: Re: Scheduled Transfer Protocol (ST)
Prev by thread: NFS Header/data parsing and RDMA
Next by thread: Comments on the current iSCSI draft
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:08:17 2001
6315 messages in chronological order