HomeNews and EventsEvents Calendar → Seminar


A New Design for High Performance Data Warehousing

Event Type: Seminar

Date: March 26, 2008

Time: 01:00PM - 02:30PM

Venue: CC-2-2545

Abstract:

Most businesses today (e.g.: Walmart) gather sales data together into data warehouses optimized for query analysis with the intent of finding what sales occur and why. A data warehouse is made up of a set of "Star Schemas", each star schema consisting of a central "Fact" table with the sales data, with joins by foreign key-primary key to a set of "Dimension" tables, such as Date, Product, Store, Customer, etc. Most star schema queries retrieve data from a fact table using WHERE clause column restrictions in dimension tables. Clustering is more important than ever with modern disk technology, as we explaine in the talk. Relatively new database indexing capabilities, e.g.: DB2's Multi-Dimensional Clustering (MDC) introduced in 2003, provide methods to "dice" the fact table along a number of orthogonal "dimensions"; these "dimensions" however must however be columns in the fact table. The diced cells cluster the fact rows on several of these "dimensions" at once so that queries with range restrictions on several such columns can access crucially localized data and provide much faster query response. Unfortunately the columns of the dimension tables of the star schema are not usually represented in the fact table, except in the uncommon case where the foreign keys for a dimension provide a hierarchy based on their order, as with the Date dimension. In this paper, we take the approach of adjoining physical copies of a few dimension columns to the fact table. We choose columns at a reasonably high level of some hierarchy commonly restricted in queries, e.g., date_year to fact_year or customer_nation to fact_nation, to ensure that the diced cubes that result are large enough that sequential access within the cubes will amortize the seek time between them, yet small enough to effectively cluster query row retrieval. We find that database products with no dicing capabilities can gain such capability by adjoining these dimension columns to the fact table, sorting the fact table rows in order by a concatenation of these columns, and then indexing the adjoined columns. We provide benchmark measurements that show successful use of this methodology on three commercial database products.

Speaker: Xuedong Chen

Speaker Bio:

Xuedong Chen graduated from Anhui University, China, in 1997 with B.S. in Applied Physics. He started Ph.D. study at Computer Science Department of UMass Boston in 2002. In the same year, he received his Master's Degree of Information System from College of Engineering at Northeastern University. Xuedong Chen's research has been focused on Data Warehouse. He was involved in an open source project to build a Column-Oriented DBMS for Data Warehouse. He and his advisors Patrick O'Neil and Betty O'Neil designed a Star Schema Benchmark for Data Warehouse. He is currently working on a new clustering method called Adjoined Dimension Columns clustering specifically for Data Warehouse. He works full-time as a Software Engineer at XOS Technology. And he is continuing his research part-time at UMass Boston. For more information, please visit Xuedong Chen's website.