1998 DARPA Research Summary
In 1998, we developed a preliminary NASD-optimized parallel file system and demonstrated linear bandwidth increases with increasing client demand and storage resources. Implemented as a middleware library extending a POSIX file system interface and offering a rudimentary implementation of the DARPA-funded Scalable I/O Initiative's Low-Level Application Programming Interface (SIO LLAPI), this parallel file system extends the scalable bandwidth potential of our NASD striping middleware to parallel applications. In our demonstration we employ a parallel data mining application (association rule discovery) on a 300 MB benchmark retail sales file using eight 255 MHz Digital UNIX Alpha workstations as parallel clients. The retail sales data is striped over eight NASD prototypes, each a 133 MHz Digital UNIX Alpha workstation, and all machines are interconnected by an 155 Mbps ATM switch. This configurations delivers 45 MB/s - over 95% of the bandwidth of the underlying hardware - linearly at about 6 MB/s per drive-client pair. In comparison, a much more powerful 500 MHz NFS server with more than twice the total raw disk bandwidth and over 35 MB/s network bandwidth cannot provide more than 23 MB/s to eight clients even if each client has a separate copy of the data (avoiding contention). We cooperated with Seagate in a proto-standards Object Oriented Device (OOD) requirement description. This interface justification and specification is being shared within the NSIC working group on network-attached storage and interested disk drive customers. Already informally presented to the ANSI X3T10 (SCSI) standards community, the OOD interface is based on CMU's NASD interface, adding emphasis on shared device access in a cluster of servers and storage self-management. Specifically, consideration is being given to the integration of multi-disk striped object management and associated locking into each OOD disk. In CMU's NASD, these functions are split between client middleware and an infrequently consulted environment-specific server (Cheops). The PDL also developed a shared striped storage simulator and scalable optimistic synchronization protocols for server clusters based on simple drive-supported conditional storage operations such as "write only if request exceeds stored timestamp". Compared these optimistic protocols, in which contention is detected and corrected after the fact instead of seeking sole access permission before each access, against traditional cacheable distributed locking (leases). Aggressive implementations of both are complex and performance depends heavily on concurrent data sharing, per-client reuse, false sharing (the appearance of contention when there is none because one lock is shared by different data regions). However, because optimistic synchronization is vulnerable to contention primarily through network message delivery time variance and clock skew, where distributed locking is vulnerable to contention for the full duration of the disk operation, optimistic synchronization performs significantly better when false sharing is common and real sharing is not (typical of existing applications and lock structures that minimize metadata size and maximize lock prefetching through colocation). We developed and implemented a native object storage system for NASD devices. Replacing the use of a traditional UNIX filesystem for storing disk objects in our prototypes, this native object system provides a direct implementation of object descriptors supporting NASD attributes, offers dynamically resizable partitions, drive and partition control objects (reusing the file access abstraction for parameter manipulation and status reporting), and a NASD-specific object cache. Contiguous allocation is enhanced with attribute-based preallocation of space and related objects are clustered according to a "locate near other object" attribute. Caching is enhanced by prefetching physically adjacent regions on any disk access, which when coupled with intra- and inter-object clustering, reduces disk accesses and aggregate disk positioning time. We expanded the prototype implementation platforms, formerly only DEC (Compaq) Alphas running Digital UNIX, to include Intel Pentium PC architectures running version 2.1 of the freely available Linux operating system. Implemented on a small embedded computer with a 200 MHz Pentium and 100 Mbps Ethernet, this Linux-based NASD port offers full NASD functionality, including a public-domain implementation of OSF's DCE RPC communications protocols. Our port to Wind River's VxWorks embedded computer operating system running on DEC (Intel) StrongARM microcontrollers is underway. VxWorks is currently operational on our StrongARM-based fast Ethernet controller card. We have also expanded the networks over which our NASD prototypes operate, formerly only OC3 ATM, to include Fast Ethernet (100 Mbps) and Myrinet (1.2 Gbps). The PDL instrumented and measured Alpha-based NASD prototype using DCE RPC over OC3 ATM (155 Mbps). Projecting onto a 200 MHz microcontroller implementation of NASD hampered by no hardware acceleration for copying, our prototype code spends 70-95% of its work in the communications protocol stack (DCE/UDP/IP) and still achieves 60%-90% of the speed of the fastest operations (cache hits) in modern specialized-hardware disks. Although this argues strongly for a specialized-hardware communications implementation, it also indicates that NASD control logic is well within the power of modern microcontrollers. Work on high-performance microprocessors embedded with drive-specific ASIC has been stopped because of the rapid acceptance of this architecture in the industry. Siemen's TriCore technology, 32-bit superscalar microprocessor with customer ASIC logic and up to 2 MB of on-chip DRAM with up to 800 MB/s microprocessor access, is now available and targeted at hard disk controllers. Moreover, Cirrus, Lucent, National Semiconductor and Intel have recently announced the intention to provide system-on-a-chip products based on the ARM/StrongARM microprocessor architecure for the hard disk controller marketplace. Performance ranges from 75 MHz with 3-way instruction issue to 200 MHz with single issue, all within a price range targeted at commodity storage products. We have also developed and implemented an (Alpha) binary transformation tool that causes a program to exploit, for aggressive prefetching, the time it is stalled waiting for disk data. Using a smart copy-on-write scheme allowing a speculative thread to pre-execute code that will run after the current disk access completes, applications transformed by this tool generate correct answers and are slowed down by disabling prefetching by only a few percent. More importantly, the execution time of our initial I/O-intensive real-world applications has been reduced when speculation-guided prefetching is active by 25%-65% given a four disk array and the TIP informed prefetching cache manager. As we have also demonstrated that the TIP cache manager can prefetch from network storage as effectively as from local storage, speculative execution will automatically transform serial-I/O, latency-limited local-disk applications into aggressively prefetching, bandwidth-limited NASD applications. Finally, we evaluated disk-embedded acceleration for scan-based applications in database search, statistical data mining, and multimedia processing. Based on VLSI technology trends, NASD disks will soon have many more processing cycles than NASD needs. Scan-based applications are often parallelizable, allowing large disk arrays to act as special parallel processors, and often output a small fraction of their input, allowing network and backplane bottlenecks to be effectively eliminated. As a demonstration of this potential use of application-code execution in NASDs we coded a version of our association rule discovery parallel data mining application to run almost exclusively in the NASD prototypes and achieved the same 45 MB/s with 8 NASD drives and no client computing resources. |