|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
> From: Costa Sapuntzakis <csapuntz@cisco.com>
> ...
> Today, you have specialized silicon that for simple bus protocols
> (SCSI parallel interface and ATA) will directly take transfer blocks
> between the device and the buffer cache. This is not currently done
> with TCP, to the best of my knowledge. ...
It might be good to investigate the history of Protocol Engines Inc.,
including its goals, the reasons for its failure as a business, and what
it achieved technically. A skewed history might be:
1. founded to make silicon for XTP, a nominally faster protocol than TCP.
2. when XTP protocol and the XTP chips got bogged down, shifted to making
chips to help TCP go wire speed over FDDI.
3. other people made TCP go wire speed over FDDI without any special
silicon or new to protocols. That took some wind out of XTP's sails,
and tore the sails driving PEI's TCP acclerator chips.
4. standard standards committee problems with XTP didn't help PEI's other
sails.
If you ask me, SCSI/IP and RDMA have striking parallels to #1 and #2.
I bet you'll meet parallels to #3 before any real deployment. You've
started to see #4 in some of the suggested improvements to RDMA today.
It's not that the suggestions are not good ideas. That problem is that
committees cannot say no to good ideas, while the one thing that matters
above all in any design task is saying no to almost everything.
Protocol Engines and XTP were based on the unexamined assumption that TCP
is very difficult to implement and an unavoidably slow protocol. Most
people just knew those "facts" 15 years ago. I think RDMA suffers a
similar problem. Instead of starting by assuming that a new protocol is
needed for a new goal, if you actually look within the existing boundaries,
you'll often find a solution. Often the inside solution is better than
any possible extension of the protocol. Protocol extensions require more
bandwidth and more processing on both sender and receiver. They also have
problems gaining enough marketshare to survive.
Please don't misunderstand me. Greg didn't include my name among
the authors on one of the XTP specs because I said XTP was a stupid
idea. I still like lots of XTP. I also think that many of the
XTP ideas can be *and have been* applied to TCP implementations.
> However, in the case of most storage protocols, you don't want
> the data in the receive buffer. You want it in the buffer cache, so
> there is a copy to the buffer cache.
Which NFS implementation written in the last 10 or at least 5 years and
intended to be fast doesn't move data between the buffer cache near the
disk and the buffer cache near the application with zero (0) copies?
Page flipping to and from buffer caches is especially easy, because
buffer caches tend to be page aligned, and file systems like to move
data in page-sized or larger chunks.
> So, NFS has a CPU overhead hit as compared to optimized storage host bus
> adapters. The goal was to eliminate part of this hit, by getting rid of an
> extra copy.
How can you have fewer than zero copies?
> Now, this proposal doesn't fix the interrupt overhead problem.
> Optimized FC/SCSI NICs have one interrupt/transfer or less.
Interrupts are killers, and so for that last 5 or 10 years, a competetive
NFS system has had about 0.1 interrupts per packet. The trick is not
reducing the ratio of interrupts/packet, but reducing it only so far that
things don't slow down, and increasing the ratio when the total system
(client & server) moves into a regime that requires more interrupts.
] From: Michael Krause <krause@cup.hp.com>
] It ain't free and there are plenty of reasons to avoid copying data since
] ...
] touching the buffers themselves. Also, one could use this technology with
] storage devices to bypass the server and send data to one or more NICs for
] remote access - RDMA is still quite good for this type of operation and
] does not involve touching the data.
There are other, much easier ways to separate data and control
information in the receiver than being forced to parse optional
new bits in TCP or IP headers.
For 10 years, network interfaces in commercial UNIX systems have been
putting the headers (including RPC/XDR) of incoming NFS traffic in one
place (a "small mbuf") and the data in another place (the buffer cache)
without extra copies, and without parsing any headers, not to mention
new header bits with the nasty problems of TCP or IP options.
And this despite the fact that the RPC/XDR stuff is between variable length
(recall the NFS group list) and a hard to predict length.
Vernon Schryver vjs@rhyolite.com
Home Last updated: Tue Sep 04 01:08:18 2001 6315 messages in chronological order |