SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: NFS Header/data parsing and RDMA



    > From: Costa Sapuntzakis <csapuntz@cisco.com>
    
    > Ok, so doing NFSv2/v3 header/data splitting is easy on an in-order
    > TCP stream because NFS has fixed-length trailers. Here's a little
    > technique:
    > ...
    
    > Note, to do this with NFS/TCP, your NIC has to do some primitive
    > level of TCP processing (at least keep track of flows). It also
    > needs to understand RPC/TCP message boundaries.
    
    Do I understand correctly that you're applying the familiar
    NFS/UDP page flipping tactic to NFS/TCP?
    
    > Are there significantly simpler approaches than this? 
    
    1. How about using NFS/UDP instead of NFS/TCP?
      It's well known in the NFS community that NFSv2-3/TCP is no faster or
      otherwise better than NFSv2-3/UDP except over very narrow or at least
      rather long pipes.  (Recall also the congestion control and avoidance
      mechanisms in some NFSv2-3/UDP implementations.)
    
    2. Use NFS/TCP, but send every RPC/XDR transaction in a single TCP segment,
      and use IP fragmentation to fit the MTU.  This tactic was used for 10+
      years ago in the FDDI adapters of some super computers.  It does have
      the problems of IP fragmentation, but those problems are rarely
      encountered where NFS is used.
    
    > NFSv4 doesn't seem to have fixed length trailers and neither
    > does CIFS in all cases. And it looks like it will be costly to parse 
    > NFSv4 headers. 
    
    I've not been paying attention to NFSv4.  A quick skim of the draft
    suggests that it will not displace NFSv2/3 in the environments where NFS
    is currently popular.  NFSv4 certainly has nothing to do with anything
    like SCSI over IP.  I'm also far from convinced that NFSv4 has got some
    of the extensions close enough to the underlying real filesystems to be
    popular.  Even if I'm wrong, it will be years before NFSv4 is widely used
    While I think there are ways to page flip NFSv4 without special hardware,
    I don't think they are worth talking about yet.  Even if I'm also wrong
    about that, it is years early to be modifying TCP/IP to support NFSv4.
    No one can see what NFSv4 will be like when it is popular enough to justify
    modifying TCP today, if NFSv4 ever is popular.
    
    
    > RDMA still has the following features:
    >
    > - Per-packet (Works with arbitrary out-of-order reception of TCP
    > segments)
    > - Fixed header that's generic across all protocols (NFSv4, v5, AFS,
    > DFS, CIFS, etc..) 
    > - No page flipping necessary on solicited transfers
    > - Message boundary bit (which is admittedly orthogonal to RDMA) allows
    > out-of-order processing on TCP receive buffer. Decreases parsing latency,
    > esp. in the face of packet drops.
    > ...
    
    Knowing to which buffer an out-of-order TCP segment belongs is something
    that I don't see how to do without something like RDMA.  However,
    out-of-order TCP segments are both very rare and very bad for TCP
    performance, regardless of whether RDMA is present.  Out of order
    TCP segments must be even more rare in storage networks.
    
    Talk about NFSv5 or even AFS/DFS does the opposite of make me think there
    might be something good in RDMA.  And as I've said, it's years too early
    to justifiy RDMA with NFSv4.
    
    With existing techniques, if you don't want to page flip, you don't need
    to.  If you are able to provide enough distinct application buffer streams
    to the NIC for RDMA, then you could do the same for other techniques.
    
    What's that about "parsing latency" and what does it have to do with 
    lost segments?  Are you proposing to deliver TCP data to applications
    out of order?  I trust not!
    
       ....
    
    ] From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com>
    
    ] ...
    ]No, that situation doesn't require any hardware support.  However, a zero-copy 
    ] receive path is not the only element of RDMA - RDMA was designed (I suppose 
    ] from the discussion here) specifically to address header/payload issues for 
    ] storage protocols.  Clearly one can do zero-copy receive with changes to the 
    ] API and no hardware/firmware modifications.  But with no special hardware 
    ] support, flipping the payload into some page with alignment constraints will 
    ] require another copy.
    
    What about the many systems that have been page flipping NFS in and out
    of buffer caches for more than 10 years, with no changes to APIs or special
    silicon?
    
    ]There is one exception to my last statement that I know of:  If you pre-adjust 
    ]the hardware receive buffers to make the payload align on a page boundary, you 
    ] can flip the page into the buffer cache for (hopefully) the common case.  
    ] However, this requires the ability to tune these header offsets and will only 
    ] work for one protocol at a time (mostly).
    
    The page flipping systems I've worked on did not tune header offsets and
    worked on more than one protocol.  (Given your email address, it might be
    interesting to check the old IRIX source trees.  Besides the NFS kernel
    code and the HIPPI, ATM, and FDDI drivers and firmware, check cmd/rcp and
    cmd/rsh.)  UDP page flipping is trivial on protocols that have no trailers.
    It requires trivial smarts in the NIC and much simpler buffer allocation
    by the NIC than RDMA requires.  (I suspect RDMA needs pools of buffers
    for every stream, while the classic tactic needs only two pools, "little"
    and "pages"....well, for tiny improvements I've also done it with "little",
    "medium" and "pages".)
    
    ] Realistically, who is going to be running a storage system that requires so 
    ] much bandwidth that avoiding receive copies is necessary, and runs on generic 
    ] NICs with no firmware/ASIC modifications possible?  So I think using modified 
    ] hardware is completely reasonable in those circumstances.
    ] ...
    
    Even more reasonable than special hardware are modified API's and protocols
    and other steps, including ensuring that out-of-order packets are very
    rare, and with header offsets are few, fixed, known, and friendly.
    
    How would you have out-of-order arrival on a storage network, other than
    due to bit rot in the wires, and what storage network is going to have
    significant bit rot?
    
    
    Vernon Schryver    vjs@rhyolite.com
    


Home

Last updated: Tue Sep 04 01:08:17 2001
6315 messages in chronological order