1999 DARPA Research Summary

    [ NASD Home | Work at CMU | Related Work | Technology Transfer ]
    [ DARPA Highlights | Recent Talks | Publications | NASD Code Downloads ]


    Released a preliminary version of the NASD prototype software to our industrial partners (Seagate, Quantum, Intel, IBM, HP, StorageTek, 3Com, Hitachi, LSI Logic, Clariion, Wind River, Siemens). A further debugged version of this code will be released to the DARPA community in summer 1999. The first release contains NASD drive object management code, simple user-level client interfaces with NFS-like semantics, a user-level aggregation library, and a simple NFS-like file manager. The NASD drive software contained in this release operates at user-level in Linux (v2.2) and at user-level and kernel-level in Digital UNIX (v3.2) and uses DCE RPC for communication. The security system in this code protects transfer and capability integrity using keyed message digests (HMAC-SHA1), but, in compliance with US government regulations, it does not contain the appropriate encryption and decryption for transfer privacy. This code runs on DEC/Compaq Alpha and Dell/Intel Pentium workstations and on DEC/Intel StrongArm embedded microcontrollers. At least three of our partners are experimenting with this code base, which will allow user-level client applications to access files striped across multiple NASD devices in parallel. Recently we have ported it to run on Quantum's JINI Technology Demonstration Disks (a 233 MHz StrongArm, 32 MB DRAM, 10/100bT ethernet, 8-10 GB ATA disk media). Quantum has arranged to donate 100 of these technology demonstration units for NASD testbed use.

    Developed and web released hardware design specifications (VHDL) for a Secure Hash Algorithm (SHA-1) message digest engine. NASD security depends heavily on efficient message digest to ensure the integrity of each capability and all data transfers. However, a software SHA-1 implementation achieves 160 Mbps with 100% of a 500 MHz Alpha (120 Mbps on a 300 MHz PII), which is both far less than the current speeds (200-300 Mbps) of disk media and far more CPU power than affordable first generation NASD drives. On our NASD prototype (133 MHz Alpha), software SHA-1 reduces peak bandwidth by a factor of 6 below the disk's raw capabilities. To combat these limitations, our hardware SHA-1 implementation, occupying about 15,000 gates of silicon, is easily affordable in current generations of disk drive ASICs. Our FPGA realization of this hardware executes SHA-1 message digests at over 200 Mbps; an embedded ASIC implementation is estimated to run at about 600 Mbps. For operations that require physical disk access, these speeds are adequate to allow NASD cryptography without disk performance impact. Unfortuantely, because disk drive interconnects operate at 1-2 Gbps, requests that manipulate only volatile state (ie., hit in the cache) will suffer response time penalties with only 600 Mbps of message digest bandwidth. Simulations of NFS and AFS workload traces indicate that this will increase latency by about 20% for typical (small) cache-hit transfers.

    Hardware cryptography has special risks because cryptographic algorithms fail (in terms of lost trust) globally when weaknesses are demonstrated and because the storage industry ships tens of millions of low-margin, zero-maintenance disks each year. A distrusted cryptographic algorithm installed in hardware in millions of warehoused products could bankrupt a wendor. Accordingly, we have put special effort into making software cryptography performance acceptable because software allows electronic updates to quickly and inexpensively correct a failed cryptographic algorithm. We have developed a hierarchical variation on a keyed message digest that first computes an unkeyed message digest on data blocks and then applies a keyed message digest on the digests produced in the first pass. While this increases the amount of digest computation by the ratio of the block size to digest size, it allows the first pass digests to be cached on disk and not recomputed on each read accesses. For block-aligned requests, this essentially eliminates cryptography as a bottleneck. If unaligned accesses are common (which is not the case in existing file systems), we then change the second pass digest algorithm to be an incremental hash (i.e., Bellare's AdHASH) which makes it computationally efficient (XOR) to manipulate digests on small blocks. These techniques do not remove the cryptographic bottleneck from NASD write operations. Fortunately, writes are considerably less common than reads and are usually tolerant of longer latencies.

    Developed and implemented secure multi-object capabilities to further increase parallel file system scalability, especially for workloads with lots of small accesses and namespace manipulation. In such workloads, common in current NFS-like file systems, the construction of per-object capabilities can increase file manager load by 30-200% relative to an (insecure) system that did not need these capabilities to be constructed. Increased file manager load causes queueing of client requests, increasing client latency for simple, metadata-intensive operations such as "list directory". Moreover, because NASD drives approve one object access per capability and one capability per RPC, "list directory with attributes" requires a potentiallly large number of interactions with the server, while most common file systems have some variation of a single "bulkstatus" RPC to get all this information. Our multi-object capability mechanism extends the Berkeley packet filter (BPF) mini-language and tailors it to digest specified field values in an object's metadata during the capability generation. By encoding file manager information like "user id" in an object's attributes, a single user-id-specific capability can be used to access all objects created by that user. Our implementation of this metadata filter capability reduced file manager load to within 1-13% larger than when no capabilities are constructed and increases the effectiveness of a capability cache in a NASD drive from 50% hit rate to 95% hit rate while only increasing NASD CPU load (for executing the filters) by 1-4%.

    Developed lightweight RPC communication packages for NASD. Our prior systems use DCE RPC, which, when security is turned off, accounts for 70-95% of the work done by our prototype NASD drive and limits the speed of fastest cache hit operations to 60-90% of modern disk drive performance. To address this problem we first decompose RPC functions into two groups: 1) language binding, argument marshalling and thread management; and 2) reliable transport. The interface between these group RPC functions is essentially Berkeley sockets, allowing TCP to be used for reliable transport. We have replaced DCE RPC non-transport functions in two different ways: at user-level we have ported and tuned DARPA-funded Exokernel's user-level networking libraries (XIO), and in the kernel, we have written a streamlined NASD-specific sockets-based RPC. The latter achieves 50% faster nil data transfers and 40% faster large data transfers on fast ether (within 15% of the bandwidth of raw TCP). We have also ported Myrinet's 1 Gbps network substrate and Intel's Virtual Interface Architecture (VIA) as an alternative reliable transport. Not yet fully tuned, Myrinet, VIA, XIO and NASD move 30 MB/s, well more than peak disk rates.

    Developed advanced NASD-embedded object layout and video playback subsystems to demonstrate new opportunities for optimization and specialization inherent in a NASD architecture. Our advanced object layout scheme was inspired by Ganger's CFFS clustering scheme; by using NASD's "locate near object" pointer to disclose the parent directory, our object system groups (but does not otherwise try for adjacency or contiguity) small objects sharing the same parent object into regions that are fetched as a unit, effectively prefetching the files of a directory on first use. This transparent layout optimization yields a 10% reduction in the overall execution time of a software development benchmark (check out, index, and compile the NASD source tree). Our video playback system stripes CBR video streams over multiple NASD drives. Playing a video amounts to informing each NASD of an agreed upon start time, the video in question and each NASD's index in that video. Based on loosely synchronized clocks, each NASD independently decides when to, and sends, its slice of the video. Designed around RSVP, RTP, and RTCP internet protocols, the NASD embedded video playback subsystem also employs drive-specific scheduling protocols to ensure and limit video playback to a specific fraction of the disk's resources. Best effort traffic is interspersed according to CVSCAN disk scheduling provided that it does not adversely impact video scheduling. This enables a NASD array to service scalable high-bandwidth, scalable video playback and scalable small object workloads.

    Developed and implemented, in IRIX (v6.5) on a 4 processor SGI Origin 200, compiler techniques for prefetching and replacing virtual memory pages when available physical memory is far less than application requirements. Static analysis of memory reuse patterns is used to predict recently accessed pages not needed for a long time and pages not recently accessed that will be needed in a short time. For the latter pages, prefetching exploits parallel disk access to increase overall I/O bandwidth and to hide disk access latency. For the former pages, early replacement (released pages) decreases the operating system's notion of the application's memory needs. Decreased memory usage decreases page faults in competing (interactive) tasks, because more memory is available to these applications. It also decreases the operating system's use of unnecessary page faults for page replacement selection (LRU approximation) algorithms because it decreases the need for this software to replace pages. Applied to four out-of-core scientific benchmarks, prefetching and releasing reduces I/O stall time by 50-100%, reduces overall exection time by 12-34%, and reduces the response time of a competing interactive task by a factor of 2.5-29.

    Explored and prototyped drive-embedded database operations to accelerate data-intensive queries in an open-source relational database system. Our results demonstrate that the basic database operations (select, project, and join) benefit from the bandwidth reduction and excess MIPS available across multiple NASD devices. The select and project/aggregate operations display linear scalability with increasing number of drives. A small prototype system with 10 disks shows a factor of 2.5 improvement and we expect to see a factor of 25 improvement using our larger 100 drive testbed. The join operation is more sensitive to the size of data sets and queries. The best case shows linear speedup with increasing numbers of disks while the worst case queries perform only slightly better than in a traditional database server, but offload a significant amount of server processing. Further, NASD's object model and the ability to use integrated drive scheduling allow background scanning applications to read data (for free) within the seeks of a foreground transaction workload. Measurements in simulation show that, even with a high foreground load, a background task can achieve 30% of the drive's peak sequential bandwidth without interfering with performance of the foreground task.


    PDL Home NASD Home

    © 2008.
    Last updated 11 November, 2004