Contact: Anastassia Ailamaki

Database systems are among the most important application domains for storage systems. Currently, we are approaching database research on several fronts, with a common goal of identifying innovative ways to improve system performance by better matching database systems to the underlying hardware.

Staged Database Systems

Contact: Anastassia Ailamaki

Traditional database system architectures face a rapidly evolving operating environment, where millions of users store and access terabytes of data. In order to cope with increasing demands for performance high-end DBMS employ parallel processing techniques coupled with a plethora of sophisticated features. However, the widely adopted, work-centric, thread-parallel execution model entails several shortcomings that limit server performance when executing workloads with changing requirements. Moreover, the monolithic approach in DBMS software has led to complex and difficult to extend designs. This project introduces a staged design for high-performance, evolvable DBMS that is easy to tune and maintain. We propose to break the database system into modules and to encapsulate them into self-contained stages connected to each other through queues. The staged, data-centric design remedies the weaknesses of modern DBMS by providing solutions at both a hardware and a software engineering level.

The Staged Database System design: Each stage has its own queue and thread support. New queries queue up in the first stage, they are encapsulated into a "packet", and pass through the five stages shown on the top of the figure. A packet carries the query’s "backpack:" its state and private data. Inside the execution engine a query can issue multiple packets to increase parallelism.

Architecture-Conscious Database Systems: The PAX Storage Model

Contact: Anastassia Ailamaki

When optimizing cache utilization, data placement is extremely important. In commercial DBMSs, data misses in the cache hierarchy are a key memory bottleneck. The traditional data placement scheme used in DBMSs, the N-ary Storage Model (NSM), stores records contiguously starting from the beginning of each disk page, and uses an offset (slot) table at the end of the page to locate the beginning of each record and offers high intra-record locality. Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. This project first demonstrates that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Therefore, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. When compared to NSM, our experimental results indicate that (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM's stall time due to data cache accesses, (b) range selection queries and updates on memory-resident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster. We have also found that PAX performs well across different memory system designs.


PAX/NSM speedup for DSS queries: Shows PAX/NSM speedups when running range selections and four TPC-H queries against a 100, 200, and 500-MB TPC-H database on top of the Shore storage manager. Decision-support systems are especially processor- and memory-bound, and PAX outperforms NSM for all these experiments. The speedups obtained are not constant across the experiments due to a combination of differing amounts of I/O and interactions between the hardware and the algorithms being used.

Improving Database Performance through Prefetching

Contact: Todd Mowry, Anastassia Ailamaki

We are exploring the latency hiding approaches, cache prefetching and I/O prefetching, to improve database performance. We have used cache prefetching to improve main memory B+tree performance (SIGMOD'01), studied fractal prefetching B+trees to optimize both disk and cache performance with a single index structure (SIGMOD'02), and improved hash join performance with prefetching (submitted). We are now focusing on prediction schemes using history information to improve general join algorithms.

Fractal Prefetching B+trees: Optimizing Both Cache and Disk Performance

Fractal Prefetching B+-Trees (fpB+-Trees) are a type of B+-Tree that optimize both cache and I/O performance by embedding "cache-optimized" trees within "disk-optimized" trees. This improves CPU cache performance in traditional B+-Trees for indexing disk resident data and I/O performance in B+-Trees optimized for cache. At a coarse granularity an fpB+-Tree contains disk-optimized nodes that are roughly the size of a disk page; at a fine granularity, it contains cache-optimized nodes that are roughly the size of a cache line. The fpB+-Tree is referred to as "fractal" because of its self-similar "tree within a tree" structure.

Recently, researchers have proposed new types of B+-Trees optimized for CPU cache performance in main memory environments, where the tree node sizes are one or a few cache lines. Unfortunately, due primarily to this large discrepancy in optimal node sizes, existing disk-optimized B+-Trees suffer from poor cache performance while cache-optimized B+-Trees exhibit poor disk performance. We propose fractal prefetching B+-Trees (fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized" trees, in order to optimize both cache and I/O performance. We design and evaluate two approaches to breaking disk pages into cache-optimized nodes: disk-first and cache-first. These approaches are somewhat biased in favor of maximizing disk and cache performance, respectively, as demonstrated by our results. Both implementations of fpB+-Trees achieve dramatically better cache performance than disk-optimized B+-Trees: a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement for range scans, and up to a 20-fold improvement for updates, all without significant degradation of I/O performance. In addition, fpB+-Trees accelerate I/O performance for range scans by using jump-pointer arrays to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's DB2 Universal Database.

Self-similar "tree within a tree" structure.

Active Disks

Today's commodity disk drives, the basic unit of storage for computer systems large and small, are actually small computers, with a processor, memory, and 'network' connection, along with the spinning magnetic material that permanently stores the data. As more and more of the information in the world becomes digitally available, and more and more of our daily activities are recorded and stored, people are increasingly finding value in analyzing, rather than simply storing and forgetting, these large masses of data. Sadly, advances in I/O performance have lagged the development of commodity processor and memory technology, putting pressure on systems to deliver data fast enough for these types of data-intensive analysis. Active Disks is a system that takes advantage of the processing power on individual disk drives to run application-level code. Moving portions of an application's processing directly to the disk drives can dramatically reduce data traffic and take advantage of the parallelism already present in large storage systems, providing a new point of leverage to overcome the I/O bottleneck.

For more information on Active Disks, please see the Active Disks project page.



Anastassia Ailamaki
Todd Mowry


Stavros Harizopoulos
Minglong Shao
Stratos Papadomanolakis
Shimin Chen



We thank the members and companies of the PDL Consortium: Alibaba Group, Amazon, Datrium, Facebook, Google, Hewlett Packard Enterprise, Hitachi Ltd., Intel Corporation, IBM, Micron, Microsoft Research, NetApp, Inc., Oracle Corporation, Salesforce, Samsung Semiconductor Inc., Seagate Technology, and Two Sigma for their interest, insights, feedback, and support.




© 2019. Legal Info.
Last updated 8 March, 2012