From bhalevy@panasas.com Mon Dec 15 23:01:00 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 65678 invoked from network); 16 Dec 2003 07:01:00 -0000 Received: from unknown (66.218.66.216) by m7.grp.scd.yahoo.com with QMQP; 16 Dec 2003 07:01:00 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta1.grp.scd.yahoo.com with SMTP; 16 Dec 2003 07:00:59 -0000 Received: from yang ([172.17.19.46]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1CV6; Tue, 16 Dec 2003 02:00:52 -0500 To: Date: Tue, 16 Dec 2003 02:01:20 -0500 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0002_01C3C378.7FB8B890" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-eGroups-Remote-IP: 65.194.124.178 From: "Benny Halevy" Subject: FW: NEPS-REQS: getting started X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy -----Original Message----- From: Garth Gibson [mailto:garth@panasas.com] Sent: Wednesday, December 10, 2003 22:27 To: Craig Everhart; John Muth; Brian Pawlowski; David Pease; Julian Satran; Spencer Shepler; Gary Grider; Brent Welch; Benny Halevy; Jon Haswell; Dean Hildebrand; Peter Honeyman; Jim Carlson; Garth Gibson; Andy Adamson; Tyce McLarty; Peter Corbett; David Black Cc: Garth Gibson Subject: NEPS-REQS: getting started So we are the requirements/problem statement subgroup of the NFS extension for parallel storage effort. Our job is to create the paper trail justification for adding something to NFS and provide a conceptual framework by which to identify possible solutions. In the beginning this document is used to justify in the IETF process that there are problems that people take seriously that cannot be handled well in the scope of NFS today and that should be. I asked around for examples to help us construct this document and I was pointed at the problem statement used to start the RDMA over IP effort (attached below). I was told that this was a particularly well done problem statement, and that we should not necessarily work this hard before giving the IETF something to look at. ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-ietf-rddp- problem-statement-02.txt RDDP Abstract: This draft addresses an IP-based solution to the problem of high system costs due to network I/O copying in end-hosts at high speeds. The problem is due to the high cost of memory bandwidth, and it can be substantially improved using "copy avoidance." The high overhead has limited the use of TCP/IP in interconnection networks especially where high bandwidth, low latency and/or low overhead of end-system data movement are required by the hosted application. So I suppose we could start with pNFS Abstract: This draft addresses an NFS-based solution to the problem of high system costs due to store-and-forward copying of storage data from storage devices through a file server mount point to high-speed end-hosts that also have connectivity to source storage devices. The problem is due to the high cost of funneling large storage bandwidths through NFS on single IP addresses, and it can be substantially improved using "out-of-band access." The high cost of high-bandwidth NFS servers has limited the use of NFS in data centers especially where high storage bandwidths are required and numerous storage serving devices are already networked together. A pNFS table of contents might be: 1. Introduction 2. The high cost of high bandwidth storage through NFS 2.1 Out-of-band access decreases bandwidth requirements in central file servers 3. Application level routing of storage data packets is the root cause of the problem 4. Storage bandwidth bottlenecks are problematic for many key file system applications 5. Out-of-band access techniques 5.1 A conceptual framework: pNFS delegated maps for distributing files over SBC, OSD and NFS storage subsystems 6. Security considerations 7. Acknowledgements 8. Informative references Please have a look at the RDDP problem statement draft and comment on my simplistic strategy of monkey-see-monkey-do :-) garth Begin forwarded message: > From: Garth Gibson > Date: Wed Dec 10, 2003 9:34:58 PM Canada/Eastern > To: Andy Adamson , David Black > , Don Cameron , Jim > Carlson , Peter Corbett , Craig > Everhart , Steve Fridella > , Garth Gibson , > Gary Grider , Benny Halevy , > Jon Haswell , Dean Hildebrand > , Peter Honeyman , > Xiaoye Jiang , Mike Kazar , > Tyce McLarty , John Muth , > Dave Noveck , Brian Pawlowski > , David Pease , > Julian Satran , Spencer Shepler > , Brent Welch > Subject: NFS Extensions for Parallel Storage, subgroup membership > > Folks, > > Thanks for a great workshop last Thursday! > > Materials presented that day are online: > http://www.citi.umich.edu/NEPS/agenda.html > > Below are the workshop followup subgroup memberships as they are now. > I think I heard Peter say that he would construct auto-managed email > lists, which from the additions I've received this week, I have > already decided would be great. Please Peter! Names like neps-all, > neps-reqs, neps-ops, neps-sbc, neps-osd, neps-nfs would be great. > > Our goals, to reprise, are to sketch a set of requirements for NFS > Extensions for Parallel Storage, or pNFS extensions, sketch a set of > NFS operation extensions (possibly including alternatives), sketch a > set of metadata definitions (possibly including alternatives) for > out-of-band data access over fixed block (SBC) SCSI protocols, object > (OSD) SCSI protocols and file (NFS) ONCRPC protocols. > > We want to do this quickly, over the next few months, and to take it > into the IETF NFS process as a set of suggestions and strawman > protocols. The current plan is that at that point those of us that > follow through with this will to it in the IETF NFS working group. In > order to convince the IETF and the NFS working group that we have > important, useful and viable ideas, we are taking a little time to > pull together starting material. > > The timelines discussed at the end of the workshop "heir of the dog" > session were: > - get workshop notes put together and out in December (Peter and Garth) > - 0th draft of a requirements/problem statement internet draft by mid > January > - IETF submission of an internet draft by first week of Feb, so it can > be part of the March IETF meeting and used as evidence for inclusion > of extensions for parallel storage into the NFS working group charter > - one or more documents (not necessarily fully agreeing) from each > subgroup into the IETF NFS email discussion for early to mid March > - a face-to-face followup workshop, open to the IETF NFS group at the > FAST 2004 conference, in San Francisco Mar 31 - Apr 2, at which all > further plans are proposed, argued and ratified (e.g. shall we be > absorbed into the IETF NFS group) > > To help move this along, we have asked one person in each subgroup to > push, prod and pull ideas and words out of us. Please help these > sacrificial volunteers with by contributing text, criticizing > constructively with alternative text, and finding the time to read > materials. > > These are volunteers in an unofficial process. We have no rules to be > applied by arbitration, no membership to take votes from. If this > consensus process, or these people, are not working out, then I > suggest grass roots alternatives be suggested and explored as a group. > Lets not get bogged down in process this early :-) > > But there are always going to be logistical and procedural issues that > we need to deal with as a group. The suggestion at the workshop was > that these multi-subgroup issues be taken into the requirements group. > For example, I suggest that "scope" issues -- what we include and > what we exclude from our agenda -- be dealt with in the requirements > group, where we would need to add/delete requirements for each > distinct aspect of our scope. > > I'm sure I'm way over the line giving this much direction :-) so I'll > leave it to the subgroups to decide mechanisms for progress. For > example, weekly conference calls, document exchange formats, > editorship delegation and/or rotation, agreement achieving processes, > .... > > And with that I'll go off and get to work on suggesting what our > problem statement needs to say. > > garth > 412-805-9878 (cell) > > ------------------------------------------------------- > > pNFS requirements: Garth Gibson > ----------------- > Andy Adamson > David Black > Jim Carlson > Peter Corbett > Craig Everhart > Garth Gibson > Gary Grider > Benny Halevy > Jon Haswell > Dean Hildebrand > Peter Honeyman > Tyce McLarty > John Muth > Brian Pawlowski > David Pease > Julian Satran > Spencer Shepler > Brent Welch Allyn Romanow (Cisco) Internet-Draft Jeff Mogul (HP) Expires: December 2003 Tom Talpey (NetApp) Stephen Bailey (Sandburst) RDMA over IP Problem Statement draft-ietf-rddp-problem-statement-02 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This draft addresses an IP-based solution to the problem of high system costs due to network I/O copying in end-hosts at high speeds. The problem is due to the high cost of memory bandwidth, and it can be substantially improved using "copy avoidance." The high overhead has limited the use of TCP/IP in interconnection networks especially where high bandwidth, low latency and/or low overhead of end-system data movement are required by the hosted application. Romanow, et al Expires December 2003 [Page 1] Internet-Draft RDMA Over IP Problem Statement June 2003 Table Of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 2. The high cost of data movement operations in network I/O . 3 2.1. Copy avoidance improves processing overhead . . . . . . . 5 3. Memory bandwidth is the root cause of the problem . . . . 6 4. High copy overhead is problematic for many key Internet applications . . . . . . . . . . . . . . . . . . . . . . . 7 5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 9 5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . 11 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 12 Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 17 Full Copyright Statement . . . . . . . . . . . . . . . . . 18 1. Introduction This draft considers the problem of high host processing overhead associated with network I/O that occurs under high speed conditions. This problem is often referred to as the "I/O bottleneck" [CT90]. More specifically, the source of high overhead that is of interest here is data movement operations - copying. This issue is not be confused with TCP offload, which is not addressed here. High speed refers to conditions where the network link speed is high relative to the bandwidths of the host CPU and memory. With today's computer systems, one Gbits/s and over is considered high speed. High costs associated with copying are an issue primarily for large scale systems. Although smaller systems such as rack-mounted PCs and small workstations would benefit from a reduction in copying overhead, the benefit to smaller machines will be primarily in the next few years as they scale in the amount of bandwidth they handle. Today it is large system machines with high bandwidth feeds, usually multiprocessors and clusters, that are adversely affected by copying overhead. Examples of such machines include all varieties of servers: database servers, storage servers, application servers for transaction processing, for e-commerce, and web serving, content distribution, video distribution, backups, data mining and decision support, and scientific computing. Note that such servers almost exclusively service many concurrent sessions (transport connections), which, in aggregate, are responsible for > 1 Gbits/s of communication. Nonetheless, the cost of copying overhead for a particular load is the same whether from few or many sessions. Romanow, et al Expires December 2003 [Page 2] Internet-Draft RDMA Over IP Problem Statement June 2003 The I/O bottleneck, and the role of data movement operations, have been widely studied in research and industry over the last approximately 14 years, and we draw freely on these results. Historically, the I/O bottleneck has received attention whenever new networking technology has substantially increased line rates - 100 Mbits/s FDDI and Fast Ethernet, 155 Mbits/s ATM, 1 Gbits/s Ethernet. In earlier speed transitions, the availability of memory bandwidth allowed the I/O bottleneck issue to be deferred. Now however, this is no longer the case. While the I/O problem is significant at 1 Gbits/s, it is the introduction of 10 Gbits/s Ethernet which is motivating an upsurge of activity in industry and research [DAFS, IB, VI, CGZ01, Ma02, MAF+02]. Because of high overhead of end-host processing in current implementations, the TCP/IP protocol stack is not used for high speed transfer. Instead, special purpose network fabrics, using a technology generally known as remote direct memory access (RDMA), have been developed and are widely used. RDMA is a set of mechanisms that allow the network adapter, under control of the application, to steer data directly into and out of application buffers. Examples of such interconnection fabrics include Fibre Channel [FIBRE] for block storage transfer, Virtual Interface Architecture [VI] for database clusters, Infiniband [IB], Compaq Servernet [SRVNET], Quadrics [QUAD] for System Area Networks. These link level technologies limit application scaling in both distance and size, meaning that the number of nodes cannot be arbitrarily large. This problem statement substantiates the claim that in network I/O processing, high overhead results from data movement operations, specifically copying; and that copy avoidance significantly decreases the processing overhead. It describes when and why the high processing overheads occur, explains why the overhead is problematic, and points out which applications are most affected. In addition, this document introduces an architectural approach to solving the problem, which is developed in detail in [BT02]. It also discusses how the proposed technology may introduce security concerns and how they should be addressed. 2. The high cost of data movement operations in network I/O A wealth of data from research and industry shows that copying is responsible for substantial amounts of processing overhead. It further shows that even in carefully implemented systems, eliminating copies significantly reduces the overhead, as referenced below. Romanow, et al Expires December 2003 [Page 3] Internet-Draft RDMA Over IP Problem Statement June 2003 Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead processing is attributable to both operating system costs such as interrupts, context switches, process management, buffer management, timer management, and to the costs associated with processing individual bytes, specifically computing the checksum and moving data in memory. They found moving data in memory is the more important of the costs, and their experiments show that memory bandwidth is the greatest source of limitation. In the data presented [CJRS89], 64% of the measured microsecond overhead was attributable to data touching operations, and 48% was accounted for by copying. The system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet packets. In a well-implemented system, copying can occur between the network interface and the kernel, and between the kernel and application buffers - two copies, each of which are two memory bus crossings - for read and write. Although in certain circumstances it is possible to do better, usually two copies are required on receive. Subsequent work has consistently shown the same phenomenon as the earlier Clark study. A number of studies report results that data- touching operations, checksumming and data movement, dominate the processing costs for messages longer than 128 Bytes [BS96, CGY01, Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- packet overheads dominate [KP96, CGY01]. The percentage of overhead due to data-touching operations increases with packet size, since time spent on per-byte operations scales linearly with message size [KP96]. For example, Chu [Ch96] reported substantial per-byte latency costs as a percentage of total networking software costs for an MTU size packet on SPARCstation/20 running memory-to-memory TCP tests over networks with 3 different MTU sizes. The percentage of total software costs attributable to per-byte operations were: 1500 Byte Ethernet 18-25% 4352 Byte FDDI 35-50% 9180 Byte ATM 55-65% Although many studies report results for data-touching operations including checksumming and data movement together, much work has focused just on copying [BS96, B99, Ch96, TK95]. For example, [KP96] reports results that separate processing times for checksum from data movement operations. For the 1500 Byte Ethernet size, 20% of total processing overhead time is attributable to copying. The study used 2 DECstations 5000/200 connected by an FDDI network. (In this study checksum accounts for 30% of the processing time.) Romanow, et al Expires December 2003 [Page 4] Internet-Draft RDMA Over IP Problem Statement June 2003 2.1. Copy avoidance improves processing overhead A number of studies show that eliminating copies substantially reduces overhead. For example, results from copy-avoidance in the IO-Lite system [PDZ99], which aimed at improving web server performance, show a throughput increase of 43% over an optimized web server, and 137% improvement over an Apache server. The system was implemented in a 4.4BSD derived UNIX kernel, and the experiments used a server system based on a 333MHz Pentium II PC connected to a switched 100 Mbits/s Fast Ethernet. There are many other examples where elimination of copying using a variety of different approaches showed significant improvement in system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We will discuss the results of one of these studies in detail in order to clarify the significant degree of improvement produced by copy avoidance [Ch02]. Recent work by Chase et al. [CGY01], measuring CPU utilization, shows that avoiding copies reduces CPU time spent on data access from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an absolute improvement of 9% due to copy avoidance. The total CPU utilization was 35%, with data access accounting for 24%. Thus the relative importance of reducing copies is 26%. At 370 Mbits/s, the system is not very heavily loaded. The relative improvement in achievable bandwidth is 34%. This is the improvement we would see if copy avoidance were added when the machine was saturated by network I/O. Note that improvement from the optimization becomes more important if the overhead it targets is a larger share of the total cost. This is what happens if other sources of overhead, such as checksumming, are eliminated. In [CGY01], after removing checksum overhead, copy avoidance reduces CPU utilization from 26% to 10%. This is a 16% absolute reduction, a 61% relative reduction, and a 160% relative improvement in achievable bandwidth. In fact, today's network interface hardware commonly offloads the checksum, which removes the other source of per-byte overhead. They also coalesce interrupts to reduce per-packet costs. Thus, today copying costs account for a relatively larger part of CPU utilization than previously, and therefore relatively more benefit is to be gained in reducing them. (Of course this argument would be specious if the amount of overhead were insignificant, but it has been shown to be substantial.) Romanow, et al Expires December 2003 [Page 5] Internet-Draft RDMA Over IP Problem Statement June 2003 3. Memory bandwidth is the root cause of the problem Data movement operations are expensive because memory bandwidth is scarce relative to network bandwidth and CPU bandwidth [PAC+97]. This trend existed in the past and is expected to continue into the future [HP97, STREAM], especially in large multiprocessor systems. With copies crossing the bus twice per copy, network processing overhead is high whenever network bandwidth is large in comparison to CPU and memory bandwidths. Generally with today's end-systems, the effects are observable at network speeds over 1 Gbits/s. A common question is whether increase in CPU processing power alleviates the problem of high processing costs of network I/O. The answer is no, it is the memory bandwidth that is the issue. Faster CPUs do not help if the CPU spends most of its time waiting for memory [CGY01]. The widening gap between microprocessor performance and memory performance has long been a widely recognized and well-understood problem [PAC+97]. Hennessy [HP97] shows microprocessor performance grew from 1980-1998 at 60% per year, while the access time to DRAM improved at 10% per year, giving rise to an increasing "processor- memory performance gap". Another source of relevant data is the STREAM Benchmark Reference Information website which provides information on the STREAM benchmark [STREAM]. The benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MBytes/s) and the corresponding computation rate for simple vector kernels measured in MFLOPS. The website tracks information on sustainable memory bandwidth for hundreds of machines and all major vendors. Results show measured system performance statistics. Processing performance from 1985-2001 increased at 50% per year on average, and sustainable memory bandwidth from 1975 to 2001 increased at 35% per year on average over all the systems measured. A similar 15% per year lead of processing bandwidth over memory bandwidth shows up in another statistic, machine balance [Mc95], a measure of the relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory ops/cycle) [STREAM]. Network bandwidth has been increasing about 10-fold roughly every 8 years, which is a 40% per year growth rate. A typical example illustrates that the memory bandwidth compares unfavorably with link speed. The STREAM benchmark shows that a modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, Romanow, et al Expires December 2003 [Page 6] Internet-Draft RDMA Over IP Problem Statement June 2003 will move the data 3 times in doing a receive operation - 1 for the network interface to deposit the data in memory, and 2 for the CPU to copy the data. With 1 GBytes/s of memory bandwidth, meaning one read or one write, the machine could handle approximately 2.67 Gbits/s of network bandwidth, one third the copy bandwidth. But this assumes 100% utilization, which is not possible, and more importantly the machine would be totally consumed! (A rule of thumb for databases is that 20% of the machine should be required to service I/O, leaving 80% for the database application. And, the less the better.) In 2001, 1 Gbits/s links were common. An application server may typically have two 1 Gbits/s connections - one connection backend to a storage server and one front-end, say for serving HTTP [FGM+99]. Thus the communications could use 2 Gbits/s. In our typical example, the machine could handle 2.7 Gbits/s at its theoretical maximum while doing nothing else. This means that the machine basically could not keep up with the communication demands in 2001, with the relative growth trends the situation only gets worse. 4. High copy overhead is problematic for many key Internet applications If a significant portion of resources on an application machine is consumed in network I/O rather than in application processing, it makes it difficult for the application to scale - to handle more clients, to offer more services. Several years ago the most affected applications were streaming multimedia, parallel file systems and supercomputing on clusters [BS96]. In addition, today the applications that suffer from copying overhead are more central in Internet computing - they store, manage, and distribute the information of the Internet and the enterprise. They include database applications doing transaction processing, e-commerce, web serving, decision support, content distribution, video distribution, and backups. Clusters are typically used for this category of application, since they have advantages of availability and scalability. Today these applications, which provide and manage Internet and corporate information, are typically run in data centers that are organized into three logical tiers. One tier is typically a set of web servers connecting to the WAN. The second tier is a set of application servers that run the specific applications usually on more powerful machines, and the third tier is backend databases. Physically, the first two tiers - web server and application server - are usually combined [Pi01]. For example an e-commerce server communicates with a database server and with a customer site, or a Romanow, et al Expires December 2003 [Page 7] Internet-Draft RDMA Over IP Problem Statement June 2003 content distribution server connects to a server farm, or an OLTP server connects to a database and a customer site. When network I/O uses too much memory bandwidth, performance on network paths between tiers can suffer. (There might also be performance issues on SAN paths used either by the database tier or the application tier.) The high overhead from network-related memory copies diverts system resources from other application processing. It also can create bottlenecks that limit total system performance. There are a large and growing number of these application servers distributed throughout the Internet. In 1999 approximately 3.4 million server units were shipped, in 2000, 3.9 million units, and the estimated annual growth rate for 2000-2004 was 17 percent [Ne00, Pa01]. There is high motivation to maximize the processing capacity of each CPU, as scaling by adding CPUs one way or another has drawbacks. For example, adding CPUs to a multiprocessor will not necessarily help, as a multiprocessor improves performance only when the memory bus has additional bandwidth to spare. Clustering can add additional complexity to handling the applications. In order to scale a cluster or multiprocessor system, one must proportionately scale the interconnect bandwidth. Interconnect bandwidth governs the performance of communication-intensive parallel applications; if this (often expressed in terms of "bisection bandwidth") is too low, adding additional processors cannot improve system throughput. Interconnect latency can also limit the performance of applications that frequently share data between processors. So, excessive overheads on network paths in a "scalable" system both can require the use of more processors than optimal, and can reduce the marginal utility of those additional processors. Copy avoidance scales a machine upwards by removing at least two- thirds the bus bandwidth load from the "very best" 1-copy (on receive) implementations, and removes at least 80% of the bandwidth overhead from the 2-copy implementations. An example showing poor performance with copies and improved scaling with copy avoidance is illustrative. The IO-Lite work [PDZ99] shows higher server throughput servicing more clients using a zero-copy system. In an experiment designed to mimic real world web conditions by simulating the effect of TCP WAN connections on the server, the performance of 3 servers was compared. One server Romanow, et al Expires December 2003 [Page 8] Internet-Draft RDMA Over IP Problem Statement June 2003 was Apache, another an optimized server called Flash, and the third the Flash server running IO-Lite, called Flash-Lite with zero copy. The measurement was of throughput in requests/second as a function of the number of slow background clients that could be served. As the table shows, Flash-Lite has better throughput, especially as the number of clients increases. Apache Flash Flash-Lite ------ ----- ---------- #Clients Thruput reqs/s Thruput Thruput 0 520 610 890 16 390 490 890 32 360 490 850 64 360 490 890 128 310 450 880 256 310 440 820 Traditional Web servers (which mostly send data and can keep most of their content in the file cache) are not the worst case for copy overhead. Web proxies (which often receive as much data as they send) and complex Web servers based on SANs or multi-tier systems will suffer more from copy overheads than in the example above. 5. Copy Avoidance Techniques There have been extensive research investigation and industry experience with two main alternative approaches to eliminating data movement overhead, often along with improving other Operating System processing costs. In one approach, hardware and/or software changes within a single host reduce processing costs. In another approach, memory-to-memory networking [MAF+02], hosts communicate via information that allows them to reduce processing costs. The single host approaches range from new hardware and software architectures [KSZ95, Wa97, DWB+93] to new or modified software systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on using a networking protocol to exchange information, the network adapter, under control of the application, places data directly into and out of application buffers, reducing the need for data movement. Commonly this approach is called RDMA, Remote Direct Memory Access. As discussed below, research and industry experience has shown that copy avoidance techniques within the receiver processing path alone have proven to be problematic. The research special purpose host adapter systems had good performance and can be seen as precursors Romanow, et al Expires December 2003 [Page 9] Internet-Draft RDMA Over IP Problem Statement June 2003 for the commercial RDMA-based NICs [KSZ95, DWB+93]. In software, many implementations have successfully achieved zero-copy transmit, but few have accomplished zero-copy receive. And those that have done so make strict alignment and no-touch requirements on the application, greatly reducing the portability and usefulness of the implementation. In contrast, experience has proven satisfactory with memory-to- memory systems that permit RDMA - performance has been good and there have not been system or networking difficulties. RDMA is a single solution. Once implemented, it can be used with any OS and machine architecture, and it does not need to be revised when either of these changes. In early work, one goal of the software approaches was to show that TCP could go faster with appropriate OS support [CJR89, CFF+94]. While this goal was achieved, further investigation and experience showed that, though possible to craft software solutions, specific system optimizations have been complex, fragile, extremely interdependent with other system parameters in complex ways, and often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, KSZ95, PDZ99]. The network I/O system interacts with other aspects of the Operating System such as machine architecture and file I/O, and disk I/O [Br99, Ch96, DP93]. For example, the Solaris Zero-Copy TCP work [Ch96], which relies on page remapping, shows that the results are highly interdependent with other systems, such as the file system, and that the particular optimizations are specific for particular architectures, meaning for each variation in architecture optimizations must be re-crafted [Ch96]. A number of research projects and industry products have been based on the memory-to-memory approach to copy avoidance. These include U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], Winsock Direct [Pi01]. Several memory-to-memory systems have been widely used and have generally been found to be robust, to have good performance, and to be relatively simple to implement. These include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet [SRVNET]. Networks based on these memory-to-memory architectures have been used widely in scientific applications and in data centers for block storage, file system access, and transaction processing. By exporting direct memory access "across the wire", applications may direct the network stack to manage all data directly from application buffers. A large and growing class of applications has already emerged which takes advantage of such capabilities, Romanow, et al Expires December 2003 [Page 10] Internet-Draft RDMA Over IP Problem Statement June 2003 including all the major databases, as well as file systems such as DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. 5.1. A Conceptual Framework: DDP and RDMA An RDMA solution can be usefully viewed as being comprised of two distinct components: "direct data placement (DDP)" and "remote direct memory access (RDMA) semantics". They are distinct in purpose and also in practice - they may be implemented as separate protocols. The more fundamental of the two is the direct data placement facility. This is the means by which memory is exposed to the remote peer in an appropriate fashion, and the means by which the peer may access it, for instance reading and writing. The RDMA control functions are semantically layered atop direct data placement. Included are operations that provide "control" features, such as connection and termination, and the ordering of operations and signaling their completions. A "send" facility is provided. While the functions (and potentially protocols) are distinct, historically both aspects taken together have been referred as "RDMA". The facilities of direct data placement are useful in and of themselves, and may be employed by other upper layer protocols to facilitate data transfer. Therefore, it is often useful to refer to DDP as the data placement functionality and RDMA as the control aspect. [BT02] develops an architecture for DDP and RDMA, and is a companion draft to this problem statement. 6. Security Considerations Solutions to the problem of reducing copying overhead in high bandwidth transfers via one or more protocols may introduce new security concerns. Any proposed solution must be analyzed for security threats and any such threats addressed. Potential security weaknesses due to resource issues that might lead to denial-of-service attacks, overwrites and other concurrent operations, the ordering of completions as required by the RDMA protocol, the granularity of transfer, and any other identified threats; need to be examined, described and an adequate solution to them found. Layered atop Internet transport protocols, the RDMA protocols will gain leverage from and must permit integration with Internet Romanow, et al Expires December 2003 [Page 11] Internet-Draft RDMA Over IP Problem Statement June 2003 security standards, such as IPSec and TLS [IPSEC, TLS]. A thorough analysis of the degree to which these protocols address potential threats is required. Security for an RDMA design requires more than just securing the communication channel. While it is necessary to be able to guarantee channel properties such as privacy, integrity, and authentication, these properties cannot defend against all attacks from properly authenticated peers, which might be malicious, compromised, or buggy. For example, an RDMA peer should not be able to read or write memory regions without prior consent. Further, it must not be possible to evade consistency checks at the recipient. The RDMA design must allow the recipient to rely on its consistent memory contents by controlling peer access to memory regions explicitly, and must disallow peer access to regions when not authorized. The RDMA protocols must ensure that regions addressable by RDMA peers be under strict application control. Remote access to local memory by a network peer introduces a number of potential security concerns. This becomes particularly important in the Internet context, where such access can be exported globally. The RDMA protocols carry in part what is essentially user information, explicitly including addressing information and operation type (read or write), and implicitly including protection and attributes. As such, the protocol requires checking of these higher level aspects in addition to the basic formation of messages. The semantics associated with each class of error must be clearly defined, and the expected action to be taken on mismatch be specified. In some cases, this will result in a catastrophic error on the RDMA association, however in others a local or remote error may be signalled. Certain of these errors may require consideration of abstract local semantics, which must be carefully specified so as to provide useful behavior while not constraining the implementation. 7. Acknowledgements Jeff Chase generously provided many useful insights and information. Thanks to Jim Pinkerton for many helpful discussions. 8. Informative References [BCF+95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- Romanow, et al Expires December 2003 [Page 12] Internet-Draft RDMA Over IP Problem Statement June 2003 second local-area network", IEEE Micro, February 1995 [BJM+96] G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, "An implementation of the Hamlyn send-managed interface architecture", in Proceedings of the Second Symposium on Operating Systems Design and Implementation, USENIX Assoc., October 1996 [BLA+94] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, "A virtual memory mapped network interface for the SHRIMP multicomputer", in Proceedings of the 21st Annual Symposium on Computer Architecture, April 1994, pp. 142-153 [Br99] J. C. Brustoloni, "Interoperation of copy avoidance in network and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 [BS96] J. C. Brustoloni, P. Steenkiste, "Effects of buffering semantics on I/O performance", Proceedings OSDI'96, USENIX, Seattle, WA October 1996, pp. 277-291 RFC Editor note: Replace following architecture draft-ietf- name, status and date with appropriate reference when assigned. [BT02] S. Bailey, T. Talpey, "The Architecture of Direct Data Placement (DDP) And Remote Direct Memory Access (RDMA) On Internet Protocols", Internet Draft Work in Progress, draft- ietf-rddp-arch-02, June 2003 [CFF+94] C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- performance TCP/IP and UDP/IP networking in DEC OSF/1 for Alpha AXP", Proceedings of the 3rd IEEE Symposium on High Performance Distributed Computing, August 1994, pp. 36-42 [CGY01] J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system optimizations for high-speed TCP", IEEE Communications Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} Romanow, et al Expires December 2003 [Page 13] Internet-Draft RDMA Over IP Problem Statement June 2003 [Ch96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 Annual Technical Conference, San Diego, CA, January 1996 [Ch02] Jeffrey Chase, Personal communication [CJRS89] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis of TCP processing overhead", IEEE Communications Magazine, volume: 27, Issue: 6, June 1989, pp 23-29 [CT90] D. D. Clark, D. Tennenhouse, "Architectural considerations for a new generation of protocols", Proceedings of the ACM SIGCOMM Conference, 1990 [DAFS] DAFS Collaborative, "Direct Access File System Specification v1.0", September 2001, available from http://www.dafscollaborative.org [DAPP93] P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, "Network subsystem design", IEEE Network, July 1993, pp. 8-17 [DP93] P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- domain transfer facility", Proceedings of the 14th ACM Symposium of Operating Systems Principles, December 1993 [DWB+93] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. Lumley, "Afterburner: architectural support for high- performance protocols", Technical Report, HP Laboratories Bristol, HPL-93-46, July 1993 [EBBV95] T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A user-level network interface for parallel and distributed computing", Proc. of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, Colorado, December 3-6, 1995 [FGM+99] R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol - HTTP/1.1", RFC 2616, June 1999 Romanow, et al Expires December 2003 [Page 14] Internet-Draft RDMA Over IP Problem Statement June 2003 [FIBRE] ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)" (and as revised and updated), ANSI X3.269:1996 [R2001], committee draft available from http://www.t10.org/drafts.htm#FibreChannel [HP97] J. L. Hennessy, D. A. Patterson, Computer Organization and Design, 2nd Edition, San Francisco: Morgan Kaufmann Publishers, 1997 [IB] InfiniBand Trade Association, "InfiniBand Architecture Specification, Volumes 1 and 2", Release 1.1, November 2002, available from http://www.infinibandta.org/specs [KP96] J. Kay, J. Pasquale, "Profiling and reducing processing overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 4, No. 6, pp.817-828, December 1996 [KSZ95] K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for outboard buffering and checksumming", SIGCOMM'95 [Ma02] K. Magoutis, "Design and Implementation of a Direct Access File System (DAFS) Kernel Server for FreeBSD", in Proceedings of USENIX BSDCon 2002 Conference, San Francisco, CA, February 11-14, 2002. [MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure and Performance of the Direct Access File System (DAFS)", accepted for publication at the 2002 USENIX Annual Technical Conference, Monterey, CA, June 9-14, 2002. [Mc95] J. D. McCalpin, "A Survey of memory bandwidth and machine balance in current high performance computers", IEEE TCCA Newsletter, December 1995 [Ne00] A. Newman, "IDC report paints conflicted picture of server market circa 2004", ServerWatch, July 24, 2000 http://serverwatch.internet.com/news/2000_07_24_a.html Romanow, et al Expires December 2003 [Page 15] Internet-Draft RDMA Over IP Problem Statement June 2003 [Pa01] M. Pastore, "Server shipments for 2000 surpass those in 1999", ServerWatch, February 7, 2001 http://serverwatch.internet.com/news/2001_02_07_a.html [PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient RAM: IRAM", IEEE Micro, April 1997 [PDZ99] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O buffering and caching system", Proc. of the 3rd Symposium on Operating Systems Design and Implementation, New Orleans, LA, February 1999 [Pi01] J. Pinkerton, "Winsock Direct: The Value of System Area Networks", May 2001, available from http://www.microsoft.com/windows2000/techinfo/ howitworks/communications/winsock.asp [Po81] J. Postel, "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981 [QUAD] Quadrics Ltd., Quadrics QSNet product information, available from http://www.quadrics.com/website/pages/02qsn.html [SDP] InfiniBand Trade Association, "Sockets Direct Protocol v1.0", Annex A of InfiniBand Architecture Specification Volume 1, Release 1.1, November 2002, available from http://www.infinibandta.org/specs [SRVNET] R. Horst, "TNet: A reliable system area network", IEEE Micro, pp. 37-45, February 1995 [STREAM] J. D. McAlpin, The STREAM Benchmark Reference Information, http://www.cs.virginia.edu/stream/ [TK95] M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O framework for UNIX", Technical Report, SMLI TR-95-39, May 1995 Romanow, et al Expires December 2003 [Page 16] Internet-Draft RDMA Over IP Problem Statement June 2003 [VI] Compaq Computer Corp., Intel Corporation and Microsoft Corporation, "Virtual Interface Architecture Specification Version 1.0", December 1997, available from http://www.vidf.org/info/04standards.html [Wa97] J. R. Walsh, "DART: Fast application-level networking via data-copy avoidance", IEEE Network, July/August 1997, pp. 28-38 Authors' Addresses Stephen Bailey Sandburst Corporation 600 Federal Street Andover, MA 01810 USA Phone: +1 978 689 1614 Email: steph@sandburst.com Jeffrey C. Mogul Western Research Laboratory Hewlett-Packard Company 1501 Page Mill Road, MS 1251 Palo Alto, CA 94304 USA Phone: +1 650 857 2206 (email preferred) Email: JeffMogul@acm.org Allyn Romanow Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134 USA Phone: +1 408 525 8836 Email: allyn@cisco.com Romanow, et al Expires December 2003 [Page 17] Internet-Draft RDMA Over IP Problem Statement June 2003 Tom Talpey Network Appliance 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 781 768 5329 Email: thomas.talpey@netapp.com Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Romanow, et al Expires December 2003 [Page 18] From bhalevy@panasas.com Mon Dec 15 23:02:52 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 63766 invoked from network); 16 Dec 2003 07:02:51 -0000 Received: from unknown (66.218.66.166) by m18.grp.scd.yahoo.com with QMQP; 16 Dec 2003 07:02:51 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta5.grp.scd.yahoo.com with SMTP; 16 Dec 2003 07:02:49 -0000 Received: from yang ([172.17.19.46]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1CWB; Tue, 16 Dec 2003 02:02:32 -0500 To: Date: Tue, 16 Dec 2003 02:03:00 -0500 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0007_01C3C378.BB3217E0" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-eGroups-Remote-IP: 65.194.124.178 From: "Benny Halevy" Subject: FW: Re: NEPS-REQS: getting started X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy -----Original Message----- From: Gary Grider [mailto:ggrider@lanl.gov] Sent: Saturday, December 13, 2003 00:02 To: Garth Gibson; Craig Everhart; John Muth; Brian Pawlowski; David Pease; Julian Satran; Spencer Shepler; Brent Welch; Benny Halevy; Jon Haswell; Dean Hildebrand; Peter Honeyman; Jim Carlson; Garth Gibson; Andy Adamson; Tyce McLarty; Peter Corbett; David Black Cc: Garth Gibson Subject: Re: NEPS-REQS: getting started I decided to toss out a very quick and dirty draft with a lot of parts missing. Nothing sacred, just thoughts as they occurred to me partially organized. I put it in Word so I could get formatting, TOC, etc. I am attaching a Word and PDF. I would be happy to put this on a web site for us if you want. I also would be happy to centralize the edits and re-post it on the web etc. Thanks Gary At 10:26 PM 12/10/2003 -0500, Garth Gibson wrote: >So we are the requirements/problem statement subgroup of the NFS >extension for parallel storage effort. > >Our job is to create the paper trail justification for adding something >to NFS and provide a conceptual framework by which to identify possible >solutions. > >In the beginning this document is used to justify in the IETF process >that there are problems that people take seriously that cannot be >handled well in the scope of NFS today and that should be. > >I asked around for examples to help us construct this document and I >was pointed at the problem statement used to start the RDMA over IP >effort (attached below). I was told that this was a particularly well >done problem statement, and that we should not necessarily work this >hard before giving the IETF something to look at. > >ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-ietf-rddp- >problem-statement-02.txt > >RDDP Abstract: This draft addresses an IP-based solution to the problem >of high system costs due to network I/O copying in end-hosts at high >speeds. The problem is due to the high cost of memory bandwidth, and >it can be substantially improved using "copy avoidance." The high >overhead has limited the use of TCP/IP in interconnection networks >especially where high bandwidth, low latency and/or low overhead of >end-system data movement are required by the hosted application. > >So I suppose we could start with > >pNFS Abstract: This draft addresses an NFS-based solution to the >problem of high system costs due to store-and-forward copying of >storage data from storage devices through a file server mount point to >high-speed end-hosts that also have connectivity to source storage >devices. The problem is due to the high cost of funneling large >storage bandwidths through NFS on single IP addresses, and it can be >substantially improved using "out-of-band access." The high cost of >high-bandwidth NFS servers has limited the use of NFS in data centers >especially where high storage bandwidths are required and numerous >storage serving devices are already networked together. > >A pNFS table of contents might be: > >1. Introduction >2. The high cost of high bandwidth storage through NFS >2.1 Out-of-band access decreases bandwidth requirements in central file >servers >3. Application level routing of storage data packets is the root cause >of the problem >4. Storage bandwidth bottlenecks are problematic for many key file >system applications >5. Out-of-band access techniques >5.1 A conceptual framework: pNFS delegated maps for distributing files >over SBC, OSD and NFS storage subsystems >6. Security considerations >7. Acknowledgements >8. Informative references > >Please have a look at the RDDP problem statement draft and comment on >my simplistic strategy of monkey-see-monkey-do :-) > >garth > > > >Begin forwarded message: > >>From: Garth Gibson >>Date: Wed Dec 10, 2003 9:34:58 PM Canada/Eastern >>To: Andy Adamson , David Black >>, Don Cameron , Jim >>Carlson , Peter Corbett , Craig >>Everhart , Steve Fridella >>, Garth Gibson , >>Gary Grider , Benny Halevy , >>Jon Haswell , Dean Hildebrand >>, Peter Honeyman , >>Xiaoye Jiang , Mike Kazar , >>Tyce McLarty , John Muth , >>Dave Noveck , Brian Pawlowski >>, David Pease , >>Julian Satran , Spencer Shepler >>, Brent Welch >>Subject: NFS Extensions for Parallel Storage, subgroup membership >> >>Folks, >> >>Thanks for a great workshop last Thursday! >> >>Materials presented that day are online: >>http://www.citi.umich.edu/NEPS/agenda.html >> >>Below are the workshop followup subgroup memberships as they are now. >>I think I heard Peter say that he would construct auto-managed email >>lists, which from the additions I've received this week, I have >>already decided would be great. Please Peter! Names like neps-all, >>neps-reqs, neps-ops, neps-sbc, neps-osd, neps-nfs would be great. >> >>Our goals, to reprise, are to sketch a set of requirements for NFS >>Extensions for Parallel Storage, or pNFS extensions, sketch a set of >>NFS operation extensions (possibly including alternatives), sketch a >>set of metadata definitions (possibly including alternatives) for >>out-of-band data access over fixed block (SBC) SCSI protocols, object >>(OSD) SCSI protocols and file (NFS) ONCRPC protocols. >> >>We want to do this quickly, over the next few months, and to take it >>into the IETF NFS process as a set of suggestions and strawman >>protocols. The current plan is that at that point those of us that >>follow through with this will to it in the IETF NFS working group. In >>order to convince the IETF and the NFS working group that we have >>important, useful and viable ideas, we are taking a little time to >>pull together starting material. >> >>The timelines discussed at the end of the workshop "heir of the dog" >>session were: >>- get workshop notes put together and out in December (Peter and Garth) >>- 0th draft of a requirements/problem statement internet draft by mid >>January >>- IETF submission of an internet draft by first week of Feb, so it can >>be part of the March IETF meeting and used as evidence for inclusion >>of extensions for parallel storage into the NFS working group charter >>- one or more documents (not necessarily fully agreeing) from each >>subgroup into the IETF NFS email discussion for early to mid March >>- a face-to-face followup workshop, open to the IETF NFS group at the >>FAST 2004 conference, in San Francisco Mar 31 - Apr 2, at which all >>further plans are proposed, argued and ratified (e.g. shall we be >>absorbed into the IETF NFS group) >> >>To help move this along, we have asked one person in each subgroup to >>push, prod and pull ideas and words out of us. Please help these >>sacrificial volunteers with by contributing text, criticizing >>constructively with alternative text, and finding the time to read >>materials. >> >>These are volunteers in an unofficial process. We have no rules to be >>applied by arbitration, no membership to take votes from. If this >>consensus process, or these people, are not working out, then I >>suggest grass roots alternatives be suggested and explored as a group. >> Lets not get bogged down in process this early :-) >> >>But there are always going to be logistical and procedural issues that >>we need to deal with as a group. The suggestion at the workshop was >>that these multi-subgroup issues be taken into the requirements group. >> For example, I suggest that "scope" issues -- what we include and >>what we exclude from our agenda -- be dealt with in the requirements >>group, where we would need to add/delete requirements for each >>distinct aspect of our scope. >> >>I'm sure I'm way over the line giving this much direction :-) so I'll >>leave it to the subgroups to decide mechanisms for progress. For >>example, weekly conference calls, document exchange formats, >>editorship delegation and/or rotation, agreement achieving processes, >>.... >> >>And with that I'll go off and get to work on suggesting what our >>problem statement needs to say. >> >>garth >>412-805-9878 (cell) >> >>------------------------------------------------------- >> >>pNFS requirements: Garth Gibson >>----------------- >>Andy Adamson >>David Black >>Jim Carlson >>Peter Corbett >>Craig Everhart >>Garth Gibson >>Gary Grider >>Benny Halevy >>Jon Haswell >>Dean Hildebrand >>Peter Honeyman >>Tyce McLarty >>John Muth >>Brian Pawlowski >>David Pease >>Julian Satran >>Spencer Shepler >>Brent Welch > > Attachment (not stored) draft-ietf-pNFS-problem-statement.pdf Type: application/pdf Attachment (not stored) draft-ietf-pNFS-problem-statement.doc Type: application/msword From bhalevy@panasas.com Mon Dec 15 23:06:28 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 27808 invoked from network); 16 Dec 2003 07:06:28 -0000 Received: from unknown (66.218.66.216) by m10.grp.scd.yahoo.com with QMQP; 16 Dec 2003 07:06:28 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta1.grp.scd.yahoo.com with SMTP; 16 Dec 2003 07:06:28 -0000 Received: from yang ([172.17.19.46]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1CW2; Tue, 16 Dec 2003 02:05:59 -0500 To: Date: Tue, 16 Dec 2003 02:06:27 -0500 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-eGroups-Remote-IP: 65.194.124.178 From: "Benny Halevy" Subject: FW: Re: NEPS-REQS: getting started X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy -----Original Message----- From: Tyce McLarty [mailto:mclarty3@llnl.gov] Sent: Monday, December 15, 2003 13:49 To: Gary Grider; Garth Gibson; Craig Everhart; John Muth; Brian Pawlowski; David Pease; Julian Satran; Spencer Shepler; Brent Welch; Benny Halevy; Jon Haswell; Dean Hildebrand; Peter Honeyman; Jim Carlson; Garth Gibson; Andy Adamson; Peter Corbett; David Black Cc: Garth Gibson Subject: Re: NEPS-REQS: getting started I've been wondering how important it is too cast the "problem" as one of cost, rather than as the ability to do things that cannot be done today with added benefits in cost reduction. I liked the list that Garth put up at the workshop: Scalable bandwidth Scalable capacity Load balancing capacity balancing plus the big winner - a standardized client. So the Introduction would be basically two paragraphs with (in either order): 1. proposal to extend NFSv4 to allow parallel out-of-band client access to data separate from metadata operations. 2. why it's important to do using the reasons outlined above. My question is - How close do we need to model the RDMA problem statement? Is cost the best/only justification or can we use new & needed capability plus value added? I think Gary has slanted his additions this direction, but seems like we should all agree on some basic principles before we get too deep in word-smithing. Thanks, Tyce At 10:02 PM 12/12/2003 -0700, Gary Grider wrote: >I decided to toss out a very quick and dirty draft with a lot of parts >missing. >Nothing sacred, just thoughts as they occurred to me partially organized. > >I put it in Word so I could get formatting, TOC, etc. > >I am attaching a Word and PDF. > >I would be happy to put this on a web site for us if you want. I also >would be happy to >centralize the edits and re-post it on the web etc. > >Thanks >Gary From bhalevy@panasas.com Mon Dec 15 23:11:59 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 45118 invoked from network); 16 Dec 2003 07:11:56 -0000 Received: from unknown (66.218.66.167) by m9.grp.scd.yahoo.com with QMQP; 16 Dec 2003 07:11:56 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta6.grp.scd.yahoo.com with SMTP; 16 Dec 2003 07:11:58 -0000 Received: from yang ([172.17.19.46]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1CXS; Tue, 16 Dec 2003 02:11:56 -0500 To: Date: Tue, 16 Dec 2003 02:12:23 -0500 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-eGroups-Remote-IP: 65.194.124.178 From: "Benny Halevy" Subject: FW: (Garth Gibson) NFS Extensions for Parallel Storage, subgroup membership X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy ADVERTISEMENT -----Original Message----- From: Garth Gibson [mailto:garth@panasas.com] Sent: Thursday, December 11, 2003 01:54 To: Andy Adamson; David Black; Don Cameron; Jim Carlson; Peter Corbett; Craig Everhart; Steve Fridella; Garth Gibson; Gary Grider; Benny Halevy; Jon Haswell; Dean Hildebranz; Peter Honeyman; Xiaoye Jiang; Mike Kazar; Tyce McLarty; John Muth; Dave Noveck; Brian Pawlowski; David Pease; Julian Satran; Spencer Shepler; Brent Welch Cc: Garth Gibson Subject: NFS Extensions for Parallel Storage, subgroup membership Folks, Thanks for a great workshop last Thursday! Materials presented that day are online: http://www.citi.umich.edu/NEPS/agenda.html Below are the workshop followup subgroup memberships as they are now. I think I heard Peter say that he would construct auto-managed email lists, which from the additions I've received this week, I have already decided would be great. Please Peter! Names like neps-all, neps-reqs, neps-ops, neps-sbc, neps-osd, neps-nfs would be great. Our goals, to reprise, are to sketch a set of requirements for NFS Extensions for Parallel Storage, or pNFS extensions, sketch a set of NFS operation extensions (possibly including alternatives), sketch a set of metadata definitions (possibly including alternatives) for out-of-band data access over fixed block (SBC) SCSI protocols, object (OSD) SCSI protocols and file (NFS) ONCRPC protocols. We want to do this quickly, over the next few months, and to take it into the IETF NFS process as a set of suggestions and strawman protocols. The current plan is that at that point those of us that follow through with this will to it in the IETF NFS working group. In order to convince the IETF and the NFS working group that we have important, useful and viable ideas, we are taking a little time to pull together starting material. The timelines discussed at the end of the workshop "heir of the dog" session were: - get workshop notes put together and out in December (Peter and Garth) - 0th draft of a requirements/problem statement internet draft by mid January - IETF submission of an internet draft by first week of Feb, so it can be part of the March IETF meeting and used as evidence for inclusion of extensions for parallel storage into the NFS working group charter - one or more documents (not necessarily fully agreeing) from each subgroup into the IETF NFS email discussion for early to mid March - a face-to-face followup workshop, open to the IETF NFS group at the FAST 2004 conference, in San Francisco Mar 31 - Apr 2, at which all further plans are proposed, argued and ratified (e.g. shall we be absorbed into the IETF NFS group) To help move this along, we have asked one person in each subgroup to push, prod and pull ideas and words out of us. Please help these sacrificial volunteers with by contributing text, criticizing constructively with alternative text, and finding the time to read materials. These are volunteers in an unofficial process. We have no rules to be applied by arbitration, no membership to take votes from. If this consensus process, or these people, are not working out, then I suggest grass roots alternatives be suggested and explored as a group. Lets not get bogged down in process this early :-) But there are always going to be logistical and procedural issues that we need to deal with as a group. The suggestion at the workshop was that these multi-subgroup issues be taken into the requirements group. For example, I suggest that "scope" issues -- what we include and what we exclude from our agenda -- be dealt with in the requirements group, where we would need to add/delete requirements for each distinct aspect of our scope. I'm sure I'm way over the line giving this much direction :-) so I'll leave it to the subgroups to decide mechanisms for progress. For example, weekly conference calls, document exchange formats, editorship delegation and/or rotation, agreement achieving processes, .... And with that I'll go off and get to work on suggesting what our problem statement needs to say. garth 412-805-9878 (cell) ------------------------------------------------------- pNFS requirements: Garth Gibson ----------------- Andy Adamson David Black Jim Carlson Peter Corbett Craig Everhart Garth Gibson Gary Grider Benny Halevy Jon Haswell Dean Hildebranz Peter Honeyman Tyce McLarty John Muth Brian Pawlowski David Pease Julian Satran Spencer Shepler Brent Welch NFSv4 ops for pNFS: Peter Honeyman ------------------ Andy Adamson David Black Peter Corbett Craig Everhart Garth Gibson Benny Halevy Jon Haswell Dean Hildebranz Peter Honeyman Xiaoye Jiang John Muth Dave Noveck Brian Pawlowski Julian Satran Spencer Shepler Brent Welch SBC metadata for pNFS: David Black --------------------- Andy Adamson David Black Jim Carlson Craig Everhart Steve Fridella Garth Gibson Xiaoye Jiang Mike Kazar John Muth David Pease Julian Satran Spencer Shepler OSD metadata for pNFS: Brent Welch --------------------- Andy Adamson Don Cameron Peter Corbett Garth Gibson Benny Halevy John Muth Julian Satran Spencer Shepler Brent Welch NFS metadata for pNFS: Peter Corbett --------------------- Andy Adamson Peter Corbett Craig Everhart Garth Gibson Jon Haswell Dean Hildebranz Peter Honeyman Xiaoye Jiang John Muth Julian Satran Spencer Shepler From pnfs-reqs@yahoogroups.com Mon Dec 15 23:51:51 2003 Return-Path: Received: (qmail 39098 invoked from network); 16 Dec 2003 07:51:50 -0000 Received: from unknown (66.218.66.216) by m12.grp.scd.yahoo.com with QMQP; 16 Dec 2003 07:51:50 -0000 Received: from unknown (HELO n6.grp.scd.yahoo.com) (66.218.66.90) by mta1.grp.scd.yahoo.com with SMTP; 16 Dec 2003 07:51:50 -0000 X-eGroups-Return: notify@yahoogroups.com Received: from [66.218.67.252] by n6.grp.scd.yahoo.com with NNFMP; 16 Dec 2003 07:51:44 -0000 Date: 16 Dec 2003 07:51:43 -0000 Message-ID: <1071561103.2719.47454.w73@yahoogroups.com> X-eGroups-Application: files X-Yahoo-Group-Post: system From: pnfs-reqs@yahoogroups.com To: pnfs-reqs@yahoogroups.com Subject: New file uploaded to pnfs-reqs MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit X-eGroups-Remote-IP: 66.218.66.90 Hello, This email message is a notification to let you know that a file has been uploaded to the Files area of the pnfs-reqs group. File : /draft-ietf-pNFS-problem-statement.doc Uploaded by : benny_halevy Description : Gary Grider's draft 2003-12-13 You can access this file at the URL http://groups.yahoo.com/group/pnfs-reqs/files/draft-ietf-pNFS-problem-statement.doc To learn more about file sharing for your group, please visit http://help.yahoo.com/help/us/groups/files Regards, benny_halevy From garth@panasas.com Wed Dec 17 21:34:01 2003 Return-Path: X-Sender: garth@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 58554 invoked from network); 18 Dec 2003 05:34:01 -0000 Received: from unknown (66.218.66.218) by m3.grp.scd.yahoo.com with QMQP; 18 Dec 2003 05:34:01 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta3.grp.scd.yahoo.com with SMTP; 18 Dec 2003 05:34:00 -0000 Received: from panasas.com ([172.17.133.207]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1NBZ; Thu, 18 Dec 2003 00:33:58 -0500 Date: Thu, 18 Dec 2003 00:34:04 -0500 Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v553) Cc: Garth Gibson To: pnfs-reqs@yahoogroups.com Content-Transfer-Encoding: 7bit Message-Id: X-Mailer: Apple Mail (2.553) X-eGroups-Remote-IP: 65.194.124.178 From: Garth Gibson Subject: Re: NEPS-REQS: getting started X-Yahoo-Group-Post: member; u=169457820 X-Yahoo-Profile: garth_a_gibson Tyce, [I've emailed this through the Yahoo group Benny set up, http://groups.yahoo.com/group/pnfs-reqs. I will forward it to the folks that have not yet joined this Yahoo group after I get it sent back to me :-)] The RDDP problem statement is similar and dissimilar to what we are doing. It is similar in that it is about higher performance, which always turns out to be cost-performance. It is dissimilar in that it was fighting an uphill battle to get RDMA into the IETF, while we are looking at no preconceived support or opposition in the IETF (that I am aware of). And it is dissimilar in that what we are proposing helps in the manageability of federated systems, which is not really a performance issue. I followed the RDDP example closely because it was easy -- our arguments on strictly bandwidth are at least as strong, in my opinion. And because I am not certain how to predict the IETF management's reaction to a manageability argument. And the standardized client code argument, although very import to some of us, seemed outside my notion of the IETF scope. Perhaps those with more experience selling ideas to the IETF could educate us? Should we focus on a small number of the most easily demonstrated problems or fill the problem statement out with all the problems we can contribute to solving? garth On Monday, December 15, 2003, at 01:49 PM, Tyce McLarty wrote: > I've been wondering how important it is too cast the "problem" as one > of cost, rather than as the ability to do things that cannot be done > today with added benefits in cost reduction. > > I liked the list that Garth put up at the workshop: > > Scalable bandwidth > Scalable capacity > Load balancing > capacity balancing > > plus the big winner - a standardized client. > > So the Introduction would be basically two paragraphs with (in either > order): > 1. proposal to extend NFSv4 to allow parallel out-of-band client > access to data separate from metadata operations. > 2. why it's important to do using the reasons outlined above. > > My question is - How close do we need to model the RDMA problem > statement? Is cost the best/only justification or can we use new & > needed capability plus value added? > > I think Gary has slanted his additions this direction, but seems like > we should all agree on some basic principles before we get too deep in > word-smithing. > > Thanks, > Tyce > > At 10:02 PM 12/12/2003 -0700, Gary Grider wrote: > >> I decided to toss out a very quick and dirty draft with a lot of >> parts missing. >> Nothing sacred, just thoughts as they occurred to me partially >> organized. >> >> I put it in Word so I could get formatting, TOC, etc. >> >> I am attaching a Word and PDF. >> >> I would be happy to put this on a web site for us if you want. I >> also would be happy to >> centralize the edits and re-post it on the web etc. >> >> Thanks >> Gary From garth@panasas.com Wed Dec 17 21:42:23 2003 Return-Path: X-Sender: garth@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 93406 invoked from network); 18 Dec 2003 05:42:23 -0000 Received: from unknown (66.218.66.167) by m11.grp.scd.yahoo.com with QMQP; 18 Dec 2003 05:42:23 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta6.grp.scd.yahoo.com with SMTP; 18 Dec 2003 05:42:22 -0000 Received: from panasas.com ([172.17.133.207]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1NCX; Thu, 18 Dec 2003 00:42:19 -0500 Date: Thu, 18 Dec 2003 00:42:22 -0500 Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v553) Cc: Garth Gibson To: pnfs-reqs@yahoogroups.com, pnfs-ops@yahoogroups.com Content-Transfer-Encoding: 7bit Message-Id: X-Mailer: Apple Mail (2.553) X-eGroups-Remote-IP: 65.194.124.178 From: Garth Gibson Subject: pNFS Discussion Summary 1: 12/18/03 X-Yahoo-Group-Post: member; u=169457820 X-Yahoo-Profile: garth_a_gibson Summary 1: 12/18/03 pNFS-ops and pNFS-reqs folks, Following on the conversation that has been going on in the pNFS-ops list since Brent put out his notes on the heir-of-the-dog meeting of Fri Dec 5, I have tried below to summarize what I see as broad issues. Your additions, corrections or directions are requested. One theme I see evolving quickly is the differing opinions of the driving requirements and how these drive differing opinions of implementation issues in the NFSv4 operations discussion. I have tried to identify which issues are more about requirements than about "how" to achieve a requirement in NFSv4. This is not intended to be a power play, by taking the topic out of the reach of anyone. It is more to clarify which topics we need to resolve by defining our scope and share with the folks that are only on the requirements email list. I imagine the resolution to these requirements-related issues will be more customer oriented and feature set driven. Topics: 0.0 Defining Requirements 1.0 Minimalism 1.1 Proxying 1.2 Cache consistency 1.3 Delegation promotion & reacquisition 1.4 Layout delegations 1.5 Concurrent write 1.6 Map revocation 1.7 Separability 1.8 NTFS application semantics ---------------------------------------- [0.0 Defining Requirements]: What is the scope of requirements subgroup doing and how is it related to the ops subgroup discussions? I am beginning to see a significant difference between a "problem statement" document and a "requirements" document. I believe that in a problem statement we can make a strong case for a set of properties and applications that are currently underserved in NFSv4, and a direction that could in one or more steps resolve some or all of the problem. Alternatively I am coming to see the detailed requirements as a compendium of the most contentious and impactful issues, how they were argued and what resolution was accepted. I can see the problem statement getting done before we have sorted out all the hard problems, or even run into all of them, so it is a good document for establishing our interests in the IETF. But I suspect that the requirements document stays open well into agreement on the specification issues. For comparison, the first NFSv4 document was called "Design Considerations" (rfc2624): This document is to cover the "limitations and deficiencies of NFS version 3". This document will also be used as a mechanism to focus discussion and avenues of investigation as the definition of NFS version 4 progresses. Therefore, the contents of this document cover the general functional/feature areas that are anticipated for NFS version 4. I propose that what we have started into in the requirements subgroup is the problem statement, and that we should be careful to not let it get bogged down in the longer term requirements resolutions. ---------------------------------------- [1.0 Minimalism]: How much additional functionality do we sacrifice to limit the changes we seek in NFSv4? On one hand, some have said that getting to one true file system, with the high performance and the manageability of federated systems that might come with out-of-band access, is worth not matching *every* feature of all existing out-of-band file systems with this first set of extensions to NFSv4. That we should bite off what we can do quickly, correctly, with a clear incremental value to NFSv4, and roadmap more aggressive changes that could bog us down, or introduce so much complexity that interoperability becomes elusive. And that we should be mindful of the reception we may get from the IETF NFS working group if we *appear* to use out-of-band as an excuse to ask for a brace of changes in other aspects of NFSv4. On the other hand, the other out-of-band file systems that are inspiring the evolution of NFSv4 have customers that may not accept any backward sets in an evolution to NFSv4. This could create the need to develop, carry and differentiate all the diverse one-off out-of-band files systems plus a new out-of-band NFSv4. Some think it makes more sense to go far enough with this first NFSv4 to simplify the marketplace by making it reasonable for various vendors to deprecate/end-of-life/begin to wean from their proprietary offering. While it is certainly conceivable that we could be designing a roadmap of solutions in detail from the start, communication among standards bodies is hard enough without the challenge of designing specs for both with and without a requirement. This is a central issue in defining the requirements for out-of-band NFSv4, or at least for defining the scope of the first set of extensions. ---------------------------------------- [1.1 Proxying]: Operations/work that can only be done out-of-band vs alternative access through the NFSv4 server for all operations/work On one hand, some suggest that a set of out-of-band clients should not have to also have a data path through the NFSv4 metadata server. One reason is that customers may not tolerate the large variability in performance between out-of-band (when the going is good) and in-band (when the server chooses not to grant or to take away a delegation) accesses. Another reason, and I paraphrase someone else here, is that it is possible to construct out-of-band metadata servers that do not have access to the data servers except through the clients -- I encourage the source of this scenario to replace my paraphrasing with a correct use case, because I find it odd to design for file servers that do not have access to the data servers. On the other hand, others have suggested that any access or work that a client can do out-of-band should be possible with one or more commands applied to the metadata server's data path. This has been proposed for coping with recalled delegations, including concurrent writing by multiple clients; retry after client access errors, provided adequate idempotency of out-of-band operations; and many alternative implementations of out-of-band clients, including legacy clients that use out-of-band never or rarely. I think this is a topic that should be argued one way or the other in the requirements document. Use cases and examples in other systems would be best. ---------------------------------------- [1.2 Cache consistency]: NFSv4 delegations are not about client cache consistency; does out-of-band access require stronger cache consistency than NFSv4 provides NFSv4 cache consistency is a client function, based on testing file attributes on open and close. While a client holds a delegation, its users can close and reopen a file without recourse to the server, so inside a delegation a client cache contents for that file must be valid and up to date. However, a client cannot mandate getting a delegation on open, it must immediately (approximately) give up a delegation if it is recalled and a client has no way to reacquire a delegation on an open file after that delegation has been recalled. So we must not confuse delegations with strong cache consistency. Many of the various proprietary out-of-band file systems have much stronger client cache consistency, involving more different types and interactions of cache callbacks. Some of these differences may have been motivated by desire for differentiation, some by apps underserved by NFS cache consistency semantics, and some by the long standing designer belief that stronger semantics are theoretically better. The question we must resolve, and argue in the requirements document, is whether out-of-band access only within the NFSv4 cache consistency and delegations is not sufficient, why and how much more must/should be added before such a product is valuable. I think that application use cases should be discussed. And I caution us that most of us are the converted, coming to NFSv4 from one of these proprietary file systems, so gaining agreement amongst ourselves easily is not a good predictor of the challenge of gaining the agreement of the NFS standards working group. ---------------------------------------- [1.3 Delegation promotion & reacquisition]: must/should NFSv4 offer mechanisms for clients to possess a delegations more than once per open Delegations in NFSv4 are new, and came with significant concern about lots of complexity for not much performance, as they may do as little as avoid the client waiting for one round trip to the server on open. So, as described above with respect to cache consistency, the limitations on delegations can mean great difficulties for clients having performance requirements calling for out-of-band access mostly, or exclusively. So we have begun to propose mechanisms for clients to be more aggressive about seeking, obtaining, reobtaining after a recall, and even waiting for a signal that a denied delegation is now available. This could lead to discussions of transitioning from a write delegation to a read delegation, rather than no delegation, when a second delegation is requested. We all know, or can imagine, plenty of mechanism for this type of logic -- after all, it is not far from what some systems do for cache consistency. But all of this comes with complexity, that threat to interoperability, and chips away at minimalism. I would suggest that capture use cases to drive requirements for controversial steps down this path. ---------------------------------------- [1.4 Layout delegations]: can/should layout metadata "ride" on NFSv4 delegations or are new "layout" delegations needed If the delegations currently provided by NFSv4 are insufficient, for reasons of cache consistency or the needed to be able to reacquire a delegation in order to ensure that performance degradations can be limited, then some are suggesting that rather than proposing to change the semantics of the current delegations, we add new delegations tailored to the purpose, so called layout delegations. This is consistent with the advice we heard Dec 4 that it is much easier, and more welcomed, to add new things to NFSv4 than to change what is already there. Assuming that in response to requirements arguments, we find the existing NFSv4 delegations insufficient, then I think this topic is an implementation issue for the NFSv4 operations subgroup. But I for one would like to err on the side of fewer NFSv4 changes and slightly weaker semantics, where possible. ---------------------------------------- [1.5 Concurrent write]: write delegations now are held by exactly one client, if any; should/must NFS support multiple clients holding concurrent layout delegations One specifically excluded use case for out-of-band access is concurrent write, actually concurrent read and write, or write and write, by different clients. This is normally associated with expensive client cache consistency algorithms, but for our purposes here, the issue is managing the ordering, grouping/atomicity, and failure recovery of changes on data servers, not updating/invalidating the contents of client caches. It is certainly feasible to address out-of-band concurrent writing to data servers without addressing client cache consistency, if we so choose. I believe three folks with experience with different existing file systems referred to databases as the use case for needing concurrent write. I believe out-of-band concurrent write is an important use case to call out carefully, because a ambitious implementation of it could lead to a lot of state-maintaining messaging. Some have said that, allowing multiple clients to hold the same lock is a current need in NFSv4, and that a solution to this can provide the infrastructure for concurrent delegation of layout maps for read and overwrite (when growing the size of the file is not needed). This seems like a good operations discussion topic. ---------------------------------------- [1.6 Map revocation]: can/must the NFS server be able to revoke a client's use of a map, and enforce no future use (fence off the map) NFSv4 delegations allow a broken or malicious client no additional power to damage the stored file system because state changes must go through the server. But a delegated layout map that is held and used by a broken or malicious client after the delegation has been recalled could damage the stored file system in a way that the server, by not being on the data path, has no obvious way to protect against. So there has been a call for the ability for the server to fence out a client or enforce the revocation of a client's access to a specific file or filesystem. At first glance all three data server technologies, blocks, objects and files have some solution (blocks: lun masking/acls or SAN zoning; objects: capability revocation, key replacement; files: component file acls, volatile file handles). The scope and cost of each of these mechanisms maybe dramatically different. Some would say that this is going to end up being a differentiating property of the choice of underlying data server. For example, many would say that in systems that allow out-of-band block access, the client machines must be trustworthy to respect the delegation recall message (and lease timeouts). Others would object to this weakening of the NFS server integrity. I also see this as a requirements argument. ---------------------------------------- [1.7 Separability]: Independence vs co-dependence of layout metadata access and NFSv4 On one hand, simple "an address per block/object/file" maps could be represented as an array of NFSv4 attributes, manipulated using existing NFSv4 attribute accessing commands, so to reduce the amount of change to NFSv4. On the other hand, particularly for block maps of large files composed of extents, simple array indexing may be cumbersome and much bulkier than necessary. And also on the other hand, some suggest that it is desirable for the metadata access protocol to be separate from NFSv4 attribute access, so that the same metadata access protocol might be reusable under other file services. I think this topic would benefit from proposed metadata formats, particularly the SBC (block) maps. ---------------------------------------- [1.8 NTFS application semantics]: applications coded to NTFS semantics are different from those coded to POSIX and UNIX semantics NFS originated as a exported file system, whose semantics were defined by the underlying local filesystem on the file server. But since that local filesystem has almost always been UNIX or UNIX like, customers have come to think of NFS semantics as a well defined thing, not far from UNIX semantics (but with a customary list of POSIX exceptions). The semantics NTFS presents to applications using its storage is different in significant ways. Some of us see an evolution to better support for clients trying to support NTFS well to be very desirable. Others see chasing this as more than the NFS group as a whole is likely to bite off. This, and any other issues about wire protocol support for important semantics needed by different application file system interfaces (middleware exploited API extensions in databases or parallel programming systems such as MPI-IO) are also requirements topics. End summary 1. From bhalevy@panasas.com Wed Dec 17 22:13:53 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 96391 invoked from network); 18 Dec 2003 06:13:52 -0000 Received: from unknown (66.218.66.217) by m5.grp.scd.yahoo.com with QMQP; 18 Dec 2003 06:13:52 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta2.grp.scd.yahoo.com with SMTP; 18 Dec 2003 06:13:52 -0000 Received: from yang ([172.17.19.55]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1NF3; Thu, 18 Dec 2003 01:13:50 -0500 To: Cc: Date: Thu, 18 Dec 2003 01:13:41 -0500 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal In-reply-to: X-eGroups-Remote-IP: 65.194.124.178 From: "Benny Halevy" Subject: RE: [pnfs-reqs] Re: NEPS-REQS: getting started X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy Garth, In case you guys want to broaden the problem statement... There are a couple of arguments I believe may be appealing to the IETF: 1. Interoperability. Several of the existing non monolithic file systems mentioned use proprietary protocols carried over Internet protocols. Standardizing their access protocols within NFS will allow interoperability between heterogeneous client hosts and heterogeneous server systems. The standardized client argument may fall into the interoperability category from the IETF point of view. 2. Taking advantage of IP SANs With the introduction of iSCSI, block and object based storage systems become accessible over IP based networks. NEPS takes advantage of this paradigm be allowing clients direct (yet moderated and secure) access to networked storage and therefore it enhances the value proposition of IP SANs. Benny > -----Original Message----- > From: Garth Gibson [mailto:garth@Panasas.Com] > Sent: Thursday, December 18, 2003 00:34 > To: pnfs-reqs@yahoogroups.com > Cc: Garth Gibson > Subject: [pnfs-reqs] Re: NEPS-REQS: getting started > > > Tyce, > > [I've emailed this through the Yahoo group Benny set up, > http://groups.yahoo.com/group/pnfs-reqs. I will forward it to the > folks that have not yet joined this Yahoo group after I get it sent > back to me :-)] > > The RDDP problem statement is similar and dissimilar to what we are > doing. It is similar in that it is about higher performance, which > always turns out to be cost-performance. It is dissimilar in that it > was fighting an uphill battle to get RDMA into the IETF, while we are > looking at no preconceived support or opposition in the IETF (that I am > aware of). And it is dissimilar in that what we are proposing helps in > the manageability of federated systems, which is not really a > performance issue. > > I followed the RDDP example closely because it was easy -- our > arguments on strictly bandwidth are at least as strong, in my opinion. > And because I am not certain how to predict the IETF management's > reaction to a manageability argument. And the standardized client code > argument, although very import to some of us, seemed outside my notion > of the IETF scope. > > Perhaps those with more experience selling ideas to the IETF could > educate us? Should we focus on a small number of the most easily > demonstrated problems or fill the problem statement out with all the > problems we can contribute to solving? > > garth > > > On Monday, December 15, 2003, at 01:49 PM, Tyce McLarty wrote: > > I've been wondering how important it is too cast the "problem" as one > > of cost, rather than as the ability to do things that cannot be done > > today with added benefits in cost reduction. > > > > I liked the list that Garth put up at the workshop: > > > > Scalable bandwidth > > Scalable capacity > > Load balancing > > capacity balancing > > > > plus the big winner - a standardized client. > > > > So the Introduction would be basically two paragraphs with (in either > > order): > > 1. proposal to extend NFSv4 to allow parallel out-of-band client > > access to data separate from metadata operations. > > 2. why it's important to do using the reasons outlined above. > > > > My question is - How close do we need to model the RDMA problem > > statement? Is cost the best/only justification or can we use new & > > needed capability plus value added? > > > > I think Gary has slanted his additions this direction, but seems like > > we should all agree on some basic principles before we get too deep in > > word-smithing. > > > > Thanks, > > Tyce > > > > At 10:02 PM 12/12/2003 -0700, Gary Grider wrote: > > > >> I decided to toss out a very quick and dirty draft with a lot of > >> parts missing. > >> Nothing sacred, just thoughts as they occurred to me partially > >> organized. > >> > >> I put it in Word so I could get formatting, TOC, etc. > >> > >> I am attaching a Word and PDF. > >> > >> I would be happy to put this on a web site for us if you want. I > >> also would be happy to > >> centralize the edits and re-post it on the web etc. > >> > >> Thanks > >> Gary > > > To unsubscribe from this group, send an email to: > pnfs-reqs-unsubscribe@yahoogroups.com > > > > Yahoo! Groups Links > > To visit your group on the web, go to: > http://groups.yahoo.com/group/pnfs-reqs/ > > To unsubscribe from this group, send an email to: > pnfs-reqs-unsubscribe@yahoogroups.com > > Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > From garth@panasas.com Thu Dec 18 14:37:55 2003 Return-Path: X-Sender: garth@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 89807 invoked from network); 18 Dec 2003 22:37:54 -0000 Received: from unknown (66.218.66.216) by m13.grp.scd.yahoo.com with QMQP; 18 Dec 2003 22:37:54 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta1.grp.scd.yahoo.com with SMTP; 18 Dec 2003 22:37:54 -0000 Received: from panasas.com ([172.17.133.207]) by PIKES.panasas.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id SVSY1RMV; Thu, 18 Dec 2003 17:37:52 -0500 Date: Thu, 18 Dec 2003 17:37:50 -0500 Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v553) To: pNFS Operations , pNFS Requirements Content-Transfer-Encoding: 7bit In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.553) X-eGroups-Remote-IP: 65.194.124.178 From: Garth Gibson Subject: Re: [pnfs-ops] pNFS Discussion Summary 1: 12/18/03: subtopic: proxying X-Yahoo-Group-Post: member; u=169457820 X-Yahoo-Profile: garth_a_gibson ADVERTISEMENT Thanks Dave. I agree. Lets refine the proxying issues: Legacy, strict, functional and recovery proxying. [1.1.0 Legacy proxying]: an NFS-v4.x server must be able to execute the full NFS-v4.0 or NFS-v4.1 protocol. I think Dave has given the case for this strongly. I do not see any case against this. ------------------------------------------- [1.1.1 Strict proxying]: does an NFS-v4.x server have to be able to execute exactly the wire packet that an NFS-v4.x client might have sent to a SBC/OSD/NFS data server? This captures the notion that a metadata server must also be a store-and-forward proxy for every data server it manages. It requires NFS-v4.x servers implement SCSI SBC over FC, if their data servers implement it; and the same for objects and files. This only makes sense to me for NFS data servers. And it is not what I intended in my prior summary, although it is a relevant question. I would say that pNFS requirements not require Strict Proxying. ------------------------------------------- [1.1.2 Functional proxying]: a file transformation achievable by an NFS-v4.x client using a set of data server operations must be a equivalently achievable using a (probably different) set of NFS-v4.x server operations This is the topic I intended to address in the last email. I believe Dave is arguing that even with metadata servers that do not have access to their data servers, the vendor of such a metadata server can construct a proprietary protocol for the metadata server to (strict) proxy data server accesses through clients that do have data server access. I am not comfortable making up a counter to this, so I exhort those that want a metadata server without data server access to speak up if they disagree. > On one hand, some suggest that a set of out-of-band clients should not > have to also have a data path through the NFSv4 metadata server. One > reason is that customers may not tolerate the large variability in > performance between out-of-band (when the going is good) and in-band > (when the server chooses not to grant or to take away a delegation) > accesses. Another reason, and I paraphrase someone else here, is that > it is possible to construct out-of-band metadata servers that do not > have access to the data servers except through the clients -- I > encourage the source of this scenario to replace my paraphrasing with > a correct use case, because I find it odd to design for file servers > that do not have access to the data servers. > > On the other hand, others have suggested that any access or work that > a client can do out-of-band should be possible with one or more > commands applied to the metadata server's data path. This has been > proposed for coping with recalled delegations, including concurrent > writing by multiple clients; retry after client access errors, > provided adequate idempotency of out-of-band operations; and many > alternative implementations of out-of-band clients, including legacy > clients that use out-of-band never or rarely. > > I think this is a topic that should be argued one way or the other in > the requirements document. Use cases and examples in other systems > would be best. ------------------------------------------- [1.1.3 Recovery proxying]: a file transformation begun by an NFS-v4.x client using a set of data server operations, but interrupted before completion, must be equivalently completable using a (probably different) set of NFS-v4.x server operations Some have suggested that having this property will greatly simplify the amount of spec that is devoted to out-of-band error recovery. Others have commented that a simple way to achieve this would be to require that all operations on data servers should be idempotent. ------------------------------------------- garth On Thursday, December 18, 2003, at 12:21 PM, Noveck, Dave wrote: > Good summary. > > I want to address the "proxying" issue. > >> [1.1 Proxying]: Operations/work that can only be done out-of-band vs >> alternative access through the NFSv4 server for all operations/work > > If you are talking about operations in the extension (let's call it > NFS-v4.x), that are not in the previous minor version (let's assume > that is nfs-v4.1), then you have a choice of whether these are > supported > for access through the server, or only for access by the client with > the > data server. Let's call this the issue of proxying in the strict > sense. > > There is another issue that people are calling "proxying" but is really > logically distinct. That is the issue of access by the previous minor > version, e.g. nfs-v4.0 or nfs-v4.1. Those versions have no concept of > separate data servers and they need to be able to work. End of story. > If you can't read files stored in nfs-v4.x with nfs-v4.0, you do not > have a minor version without proxying. You don't have a minor version > at all. I believe the working group is never going to accept that. > Even if I'm wrong and you can get the working group to accept that, > it is going to be very contentious and thus take up a lot of time. > Anybody, who really wants to go down this path should seriously > consider > the trade-off between supporting something they find objectionable and > getting a standard a lot later, if at all. > >> On one hand, some suggest that a set of out-of-band clients should not >> have to also have a data path through the NFSv4 metadata server. One >> reason is that customers may not tolerate the large variability in >> performance between out-of-band (when the going is good) and in-band >> (when the server chooses not to grant or to take away a delegation) >> accesses. > > Then such customers will use clients that access things out-of-band > whenever possible, and servers that never refuse to give out layout > delegations. You have a number of quality-of-implementations issues > for v4.x clients and servers. If a particular client only supports > access via v4.0, then performance will suck, and the working group > will understand that, but it won't accept not being able to use > v4.0 at all. The customer is going to be motivated to upgrade his > clients for those that need high-performance access, but he may be > OK with some clients using v4.0 for a long time, depending on the > particular performance those clients need. (And some will want v2/v3 > access but that is a matter that the working group has no say about). > >> Another reason, and I paraphrase someone else here, is that >> it is possible to construct out-of-band metadata servers that do not >> have access to the data servers except through the clients -- I >> encourage the source of this scenario to replace my paraphrasing with >> a >> correct use case, because I find it odd to design for file servers >> that >> do not have access to the data servers. > > So let's grant that it is possible (and we'll pass over the issue of > whether it is desirable, and in fact so desirable that one is willing > to > not get a standard and or get it much later). > > So we have a metadata server and it, for whatever reason, does not have > access to the data servers. However, by hypothesis, there are machines > (e.g. clients), that can communicate with both. So, if one has such an > architecture, then one can take such a machine, give it a > communication path > to the meta-data server and the data server and have the meta-data > server > transfer v4.0 READ requests to it, let it read the data from the data > server and send it back to the meta-data server who send it back to the > original requestor. Is that a very good solution? No. Is it likely > to be performant? No. Will it satisfy any particular customer? I > don't > know and that is the implementer's business decision. Will it satisfy > the hypothetical customer who doesn't care about v4.0 access? Clearly. > Will it satisfy the v4 working group? Yes, because they are not in the > business of telling you how performant v4.0 access has got to be. > >> On the other hand, others have suggested that any access or work that >> a >> client can do out-of-band should be possible with one or more commands >> applied to the metadata server's data path. This has been proposed >> for >> coping with recalled delegations, including concurrent writing by >> multiple clients; retry after client access errors, provided adequate >> idempotency of out-of-band operations; and many alternative >> implementations of out-of-band clients, including legacy clients that >> use out-of-band never or rarely. > > This effort is going to take a while, but if we manage it correctly, it > is not going to take so long that v3 clients are going to be rare > things, > and they have to be supported. But v3 clients are not an issue for the > working group. V4.0 clients are and they will be around and you will > have to support them, and I believe the working group is not going to > be disposed to cut you a lot of slack on this issue (and I don't see > why it should). > >> I think this is a topic that should be argued one way or the other in >> the requirements document. Use cases and examples in other systems >> would be best. > > I think the requirement should be that this work should be done as a > set of extensions to nfs-v4 delivered as a v4 minor version. If there > is some feature/requirement that conflicts with that model (and it is a > pretty flexible one), then you have to think long and hard before > deciding > that that requirement is more important than this basic deivery > vehicle, > because it seems to me that it is, in almost all respects, the ideal > way > to make this sort of technology available for widespread use. > > > > > > > To unsubscribe from this group, send an email to: > pnfs-ops-unsubscribe@yahoogroups.com > > > > Yahoo! Groups Links > > To visit your group on the web, go to: > http://groups.yahoo.com/group/pnfs-ops/ > > To unsubscribe from this group, send an email to: > pnfs-ops-unsubscribe@yahoogroups.com > > Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > From julian_satran@il.ibm.com Mon Dec 22 02:26:02 2003 Return-Path: X-Sender: Julian_Satran@il.ibm.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 86299 invoked from network); 22 Dec 2003 10:26:01 -0000 Received: from unknown (66.218.66.218) by m11.grp.scd.yahoo.com with QMQP; 22 Dec 2003 10:26:01 -0000 Received: from unknown (HELO mtagate3.de.ibm.com) (195.212.29.152) by mta3.grp.scd.yahoo.com with SMTP; 22 Dec 2003 10:26:00 -0000 Received: from d12relay01.megacenter.de.ibm.com (d12relay01.megacenter.de.ibm.com [9.149.165.180] (may be forged)) by mtagate3.de.ibm.com (8.12.10/8.12.10) with ESMTP id hBMAPxn0031456; Mon, 22 Dec 2003 10:25:59 GMT Received: from d10ml001.telaviv.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228]) by d12relay01.megacenter.de.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id hBMAPwG4256428; Mon, 22 Dec 2003 11:25:58 +0100 In-Reply-To: To: pnfs-ops@yahoogroups.com Cc: pnfs-ops@yahoogroups.com, pnfs-reqs@yahoogroups.com MIME-Version: 1.0 X-Mailer: Lotus Notes Release 6.5 September 26, 2003 Message-ID: Date: Mon, 22 Dec 2003 12:25:57 +0200 X-MIMETrack: Serialize by Router on D10ML001/10/M/IBM(Release 6.0.2CF2|July 23, 2003) at 22/12/2003 12:25:58, Serialize complete at 22/12/2003 12:25:58 Content-Type: multipart/alternative; boundary="=_alternative 00394B9DC2256E04_=" X-eGroups-Remote-IP: 195.212.29.152 From: Julian Satran Subject: RE: [pnfs-ops] pNFS Discussion Summary 1: 12/18/03 X-Yahoo-Group-Post: member; u=64714603 Since I raised the issue of the metadata server not having access to all it's data servers (or at least not with adequate bandwidth) I feel compelled to say that Dave's arguments about supporting 4.0 are compelling enough to make it mandatory. The open issue is if it is legal for a "compliant server" to have serving data disabled by a local administrative function (the old "must implement but may use"). Otherwise an organization that wants to discourage use of data serving through the metadata server has very little it can do to enforce policy in a way that will not affect other clients (it may do serve poorly but this still affects other clients). Julo "Noveck, Dave" 18/12/2003 19:21 Please respond to pnfs-ops@yahoogroups.com To , cc Subject RE: [pnfs-ops] pNFS Discussion Summary 1: 12/18/03 Good summary. I want to address the "proxying" issue. > [1.1 Proxying]: Operations/work that can only be done out-of-band vs > alternative access through the NFSv4 server for all operations/work If you are talking about operations in the extension (let's call it NFS-v4.x), that are not in the previous minor version (let's assume that is nfs-v4.1), then you have a choice of whether these are supported for access through the server, or only for access by the client with the data server. Let's call this the issue of proxying in the strict sense. There is another issue that people are calling "proxying" but is really logically distinct. That is the issue of access by the previous minor version, e.g. nfs-v4.0 or nfs-v4.1. Those versions have no concept of separate data servers and they need to be able to work. End of story. If you can't read files stored in nfs-v4.x with nfs-v4.0, you do not have a minor version without proxying. You don't have a minor version at all. I believe the working group is never going to accept that. Even if I'm wrong and you can get the working group to accept that, it is going to be very contentious and thus take up a lot of time. Anybody, who really wants to go down this path should seriously consider the trade-off between supporting something they find objectionable and getting a standard a lot later, if at all. > On one hand, some suggest that a set of out-of-band clients should not > have to also have a data path through the NFSv4 metadata server. One > reason is that customers may not tolerate the large variability in > performance between out-of-band (when the going is good) and in-band > (when the server chooses not to grant or to take away a delegation) > accesses. Then such customers will use clients that access things out-of-band whenever possible, and servers that never refuse to give out layout delegations. You have a number of quality-of-implementations issues for v4.x clients and servers. If a particular client only supports access via v4.0, then performance will suck, and the working group will understand that, but it won't accept not being able to use v4.0 at all. The customer is going to be motivated to upgrade his clients for those that need high-performance access, but he may be OK with some clients using v4.0 for a long time, depending on the particular performance those clients need. (And some will want v2/v3 access but that is a matter that the working group has no say about). > Another reason, and I paraphrase someone else here, is that > it is possible to construct out-of-band metadata servers that do not > have access to the data servers except through the clients -- I > encourage the source of this scenario to replace my paraphrasing with a > correct use case, because I find it odd to design for file servers that > do not have access to the data servers. So let's grant that it is possible (and we'll pass over the issue of whether it is desirable, and in fact so desirable that one is willing to not get a standard and or get it much later). So we have a metadata server and it, for whatever reason, does not have access to the data servers. However, by hypothesis, there are machines (e.g. clients), that can communicate with both. So, if one has such an architecture, then one can take such a machine, give it a communication path to the meta-data server and the data server and have the meta-data server transfer v4.0 READ requests to it, let it read the data from the data server and send it back to the meta-data server who send it back to the original requestor. Is that a very good solution? No. Is it likely to be performant? No. Will it satisfy any particular customer? I don't know and that is the implementer's business decision. Will it satisfy the hypothetical customer who doesn't care about v4.0 access? Clearly. Will it satisfy the v4 working group? Yes, because they are not in the business of telling you how performant v4.0 access has got to be. > On the other hand, others have suggested that any access or work that a > client can do out-of-band should be possible with one or more commands > applied to the metadata server's data path. This has been proposed for > coping with recalled delegations, including concurrent writing by > multiple clients; retry after client access errors, provided adequate > idempotency of out-of-band operations; and many alternative > implementations of out-of-band clients, including legacy clients that > use out-of-band never or rarely. This effort is going to take a while, but if we manage it correctly, it is not going to take so long that v3 clients are going to be rare things, and they have to be supported. But v3 clients are not an issue for the working group. V4.0 clients are and they will be around and you will have to support them, and I believe the working group is not going to be disposed to cut you a lot of slack on this issue (and I don't see why it should). > I think this is a topic that should be argued one way or the other in > the requirements document. Use cases and examples in other systems > would be best. I think the requirement should be that this work should be done as a set of extensions to nfs-v4 delivered as a v4 minor version. If there is some feature/requirement that conflicts with that model (and it is a pretty flexible one), then you have to think long and hard before deciding that that requirement is more important than this basic deivery vehicle, because it seems to me that it is, in almost all respects, the ideal way to make this sort of technology available for widespread use. To unsubscribe from this group, send an email to: pnfs-ops-unsubscribe@yahoogroups.com Yahoo! Groups Links To visit your group on the web, go to: http://groups.yahoo.com/group/pnfs-ops/ To unsubscribe from this group, send an email to: pnfs-ops-unsubscribe@yahoogroups.com Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/ From bhalevy@panasas.com Mon Dec 22 11:42:01 2003 Return-Path: X-Sender: bhalevy@panasas.com X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 94145 invoked from network); 22 Dec 2003 19:41:59 -0000 Received: from unknown (66.218.66.166) by m6.grp.scd.yahoo.com with QMQP; 22 Dec 2003 19:41:59 -0000 Received: from unknown (HELO PIKES.panasas.com) (65.194.124.178) by mta5.grp.scd.yahoo.com with SMTP; 22 Dec 2003 19:41:59 -0000 Received: by PIKES.panasas.com with Internet Mail Service (5.5.2653.19) id ; Mon, 22 Dec 2003 14:41:57 -0500 Message-ID: <30489F1321F5C343ACF6872B2CF7942A05D38733@PIKES.panasas.com> To: "'julian_satran@il.ibm.com'" , "'pnfs-ops@yahoogroups.com'" Cc: "'pnfs-reqs@yahoogroups.com'" Date: Mon, 22 Dec 2003 14:41:53 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" X-eGroups-Remote-IP: 65.194.124.178 From: "Halevy, Benny" Subject: RE: [pnfs-ops] delegation arguments summary X-Yahoo-Group-Post: member; u=169276676 X-Yahoo-Profile: benny_halevy > > * layout delegation revocation (and enforcement of) > > This issue is orthogonal. We dicussed volatile file handles, OSD > > capabilities, and SAN LUN mapping techniques. > > > > Almost orthogonal. There is a subtle problem of sharing layout delegations if one of clientts is doing writes or appends. This falls under CW (concurrent write) sharing since there is one or more writers. By saying "this issue is orthogonal" I meant that the mechanism for revoking the layout delegation is orthogonal to whether we need a complete new set of delegations or extend the current model. I agree that when the layout changes due to writes, appends, or for any other reason the server has to recall layout delegations, at least from those clients that requested layout for region that's about to be the changed. Hopefully, all clients behave nicely and their delegations do not have to be revoked. You want to revoke the layout delegation from unresponsive clients since allowing them to use the stale layout may end up with data corruption. Speaking of append, I always thought it'd be really nice to have an NFS APPEND operation... This seems like something we can propose right away on nfsv4@ietf.org How does people on this list feel about that? A use case I encountered is a customer that use a shared file as a log and have multiple nodes in the cluster appending to that file with some coordination (right now, NFSv3 + NLM). They don't care about ordering of the appended records and they even accept records written more than once to the file, but they do care about the consistency of each record so writers can't just silently overwrite each other. > The issue is furthermore complicated by the "sparse" layout that we all want to support (do we?) Can you please turn the details knob on "sparse" layout and maybe give a concrete example where this layout make the proposed model fall short? > > layout delegation: > > - returned on READ_IND, WRITE_IND, LAYOUT_DELEG_ASK > > > > Covers only layout (aggregation header, map, handles/caps). > > Optional, recallable, revocable. > > Assures the client that the layout information it has will not change. > > But the layout information may change even in the most trivial single writer case and definitely in RW cases. Correct, when the layout is about to be changed (a writer calls COMMIT_IND) or when there is a write-write conflict (two clients call WRITE_IND for overlapping regions) some or all layout delegations must be recalled. > > WRITE yes client can safely cache read and write data, > > serve opens, and locks locally and can perform > > out-of-band or server reads and writes. > At least this requires mapping updates for block storage. > For those souls that want strict local-FS semantics (UNIX) cache and map invalidations can be a side-effect of the byte-range locking mechanism. This sounds like something that falls into the distributed cache coherency realm - meaning multiple clients have a CW data delegation and a layout delegation. My assumption was that in this case the logical block map changes rarely when the clients are writing in place, otherwise they should fall back to writing through the server. Having an efficient distributed cache coherency mechanism in NFS seems to me like a stretch but it's worth a discussion to see if block based SAN filesystems can or can't live without it. Benny From ggrider@lanl.gov Mon Dec 22 11:53:58 2003 Return-Path: X-Sender: ggrider@lanl.gov X-Apparently-To: pnfs-reqs@yahoogroups.com Received: (qmail 3731 invoked from network); 22 Dec 2003 19:53:57 -0000 Received: from unknown (66.218.66.166) by m11.grp.scd.yahoo.com with QMQP; 22 Dec 2003 19:53:57 -0000 Received: from unknown (HELO mailwasher-b.lanl.gov) (192.16.0.25) by mta5.grp.scd.yahoo.com with SMTP; 22 Dec 2003 19:53:57 -0000 Received: from mailrelay3.lanl.gov (localhost.localdomain [127.0.0.1]) by mailwasher-b.lanl.gov (8.12.10/8.12.10/(ccn-5)) with ESMTP id hBMJrufK001673; Mon, 22 Dec 2003 12:53:56 -0700 Received: from cic-mail.lanl.gov (localhost.localdomain [127.0.0.1]) by mailrelay3.lanl.gov (8.12.10/8.12.10/(ccn-5)) with ESMTP id hBMJrtIt031106; Mon, 22 Dec 2003 12:53:55 -0700 Received: from cthulu.lanl.gov (vpn-client-189.lanl.gov [128.165.253.189]) by cic-mail.lanl.gov (8.12.10/8.12.10/(ccn-5)) with ESMTP id hBMJrqFR016230; Mon, 22 Dec 2003 12:53:53 -0700 Message-Id: <5.2.0.9.2.20031222125146.018b3cc0@cic-mail.lanl.gov> X-Sender: ggrider@cic-mail.lanl.gov X-Mailer: QUALCOMM Windows Eudora Version 5.2.0.9 Date: Mon, 22 Dec 2003 12:53:51 -0700 To: pnfs-reqs@yahoogroups.com, "'julian_satran@il.ibm.com'" , "'pnfs-ops@yahoogroups.com'" Cc: "'pnfs-reqs@yahoogroups.com'" In-Reply-To: <30489F1321F5C343ACF6872B2CF7942A05D38733@PIKES.panasas.com > Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="=====================_15088946==.ALT" X-Scanned-By: MIMEDefang 2.35 X-eGroups-Remote-IP: 192.16.0.25 From: Gary Grider Subject: Re: [pnfs-reqs] RE: [pnfs-ops] delegation arguments summary X-Yahoo-Group-Post: member; u=169341185 X-Yahoo-Profile: ggriderpnfs At 02:41 PM 12/22/2003 -0500, Halevy, Benny wrote: > > > * layout delegation revocation (and enforcement of) > > > This issue is orthogonal. We dicussed volatile file handles, OSD > > > capabilities, and SAN LUN mapping techniques. > > > > > > > Almost orthogonal. There is a subtle problem of sharing layout delegations if one of clientts is doing writes or appends. > > This falls under CW (concurrent write) sharing since there is one or more writers. > By saying "this issue is orthogonal" I meant that the mechanism for revoking the > layout delegation is orthogonal to whether we need a complete new set of > delegations or extend the current model. > > I agree that when the layout changes due to writes, appends, or for any other > reason the server has to recall layout delegations, at least from those clients > that requested layout for region that's about to be the changed. Hopefully, > all clients behave nicely and their delegations do not have to be revoked. > You want to revoke the layout delegation from unresponsive clients since allowing > them to use the stale layout may end up with data corruption. > > Speaking of append, I always thought it'd be really nice to have an NFS APPEND > operation... This seems like something we can propose right away on nfsv4@ietf.org > How does people on this list feel about that? > > A use case I encountered is a customer that use a shared file as a log and have > multiple nodes in the cluster appending to that file with some coordination > (right now, NFSv3 + NLM). They don't care about ordering of the appended records > and they even accept records written more than once to the file, but they do care > about the consistency of each record so writers can't just silently overwrite > each other. > > > The issue is furthermore complicated by the "sparse" layout that we all want to support (do we?) > > Can you please turn the details knob on "sparse" layout and maybe give a > concrete example where this layout make the proposed model fall short? > > > > layout delegation: > > > - returned on READ_IND, WRITE_IND, LAYOUT_DELEG_ASK > > > > > > Covers only layout (aggregation header, map, handles/caps). > > > Optional, recallable, revocable. > > > Assures the client that the layout information it has will not change. > > > > But the layout information may change even in the most trivial single writer case and definitely in RW cases. > > Correct, when the layout is about to be changed (a writer calls COMMIT_IND) > or when there is a write-write conflict (two clients call WRITE_IND for > overlapping regions) some or all layout delegations must be recalled. > > > > WRITE yes client can safely cache read and write data, > > > serve opens, and locks locally and can perform > > > out-of-band or server reads and writes. > > At least this requires mapping updates for block storage. > > For those souls that want strict local-FS semantics (UNIX) cache and map invalidations