A New Design for High Performance Data Warehousing
Event Type: Seminar
Date: March 26, 2008
Time:
01:00PM
- 02:30PM
Venue:
CC-2-2545
Abstract:
Most businesses today (e.g.: Walmart) gather sales data together into
data warehouses optimized for query analysis with the intent of finding
what sales occur and why. A data warehouse is made up of a set of "Star
Schemas", each star schema consisting of a central "Fact" table with the
sales data, with joins by foreign key-primary key to a set of
"Dimension" tables, such as Date, Product, Store, Customer, etc. Most
star schema queries retrieve data from a fact table using WHERE clause
column restrictions in dimension tables. Clustering is more important
than ever with modern disk technology, as we explaine in the talk.
Relatively new database indexing capabilities, e.g.: DB2's
Multi-Dimensional Clustering (MDC) introduced in 2003, provide methods
to "dice" the fact table along a number of orthogonal "dimensions";
these "dimensions" however must however be columns in the fact table.
The diced cells cluster the fact rows on several of these "dimensions"
at once so that queries with range restrictions on several such columns
can access crucially localized data and provide much faster query
response. Unfortunately the columns of the dimension tables of the star
schema are not usually represented in the fact table, except in the
uncommon case where the foreign keys for a dimension provide a hierarchy
based on their order, as with the Date dimension.
In this paper, we take the approach of adjoining physical copies of a
few dimension columns to the fact table. We choose columns at a
reasonably high level of some hierarchy commonly restricted in queries,
e.g., date_year to fact_year or customer_nation to fact_nation, to
ensure that the diced cubes that result are large enough that sequential
access within the cubes will amortize the seek time between them, yet
small enough to effectively cluster query row retrieval. We find that
database products with no dicing capabilities can gain such capability
by adjoining these dimension columns to the fact table, sorting the fact
table rows in order by a concatenation of these columns, and then
indexing the adjoined columns. We provide benchmark measurements that
show successful use of this methodology on three commercial database
products.
Speaker:
Xuedong Chen
Speaker Bio:
Xuedong Chen graduated from Anhui University, China, in 1997 with B.S.
in Applied Physics. He started Ph.D. study at Computer Science
Department of UMass Boston in 2002. In the same year, he received his
Master's Degree of Information System from College of Engineering at
Northeastern University. Xuedong Chen's research has been focused on
Data Warehouse. He was involved in an open source project to build a
Column-Oriented DBMS for Data Warehouse. He and his advisors Patrick
O'Neil and Betty O'Neil designed a Star Schema Benchmark for Data
Warehouse. He is currently working on a new clustering method called
Adjoined Dimension Columns clustering specifically for Data Warehouse.
He works full-time as a Software Engineer at XOS Technology. And he is
continuing his research part-time at UMass Boston. For more information,
please visit Xuedong Chen's
website.