AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning
Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM). Santorini Island, Greece. June 21-23, 2004.
Stratos Papadomanolakis, Anastassia Ailamaki
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as
astronomy and biology. Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Current automated physical design tools focus on the selection of indexes and materialized views. In large-scale scientific databases, however, the data volume and the continuous insertion of new data allows for only limited indexes and materialized views. By contrast, data partitioning does not replicate data, thereby reducing space requirements and minimizing update overhead. In this paper we present AutoPart, an algorithm that automatically partitions database tables to optimize sequential access assuming prior knowledge of a representative workload. The resulting schema is indexed using a fraction of the space required for indexing the original schema. To evaluate AutoPart we built an automated schema design tool that interfaces to commercial database systems. We experiment with AutoPart in the context of the Sloan Digital Sky Survey database, a real-world astronomical database, running on SQL Server 2000. Our experiments demonstrate the benefits of partitioning for large-scale systems: Partitioning alone improves query execution performance by a factor of two on average. Combined with indexes, the new schema also outperforms the indexed original schema by 20% (for queries) and a factor of five (for updates), while using only half the original index space.