HomeNews and EventsEvents Calendar → Dissertation Defense

Query-Centric Storage Partitioning for Distributed Systems


07:00PM - 08:00PM, March 20, 2017




Ting Zhang


For a storage system to keep pace with increasing amounts of data, a natural solution is to deploy more servers to expand storage capacity and mitigate server bottleneck. Due to the large quantity, these servers need to be placed at geographically distributed locations, causing inevitable communication costs. Subsequently, an important design problem is how to best partition the data across the servers. To minimize cross-server traffic, the mainstream approach is data-centric, where data with similar content are assigned to the same server. It is however difficult to effectively quantify content similarity in cases where the content has many attributes or belongs to incomparable categories. In contrast, this dissertation advocates a query-centric storage approach where the only input information is queries and the data partitioner is aimed to assign data often queried together on the same server. This approach avoids the assumption on the existence of a content similarity measure, thus applicable to both similarity search and non-similarity search. Following this approach, if all queries are given in advance, an optimal partitioner can be found by solving a classic hypergraph partitioning problem. The focus of this dissertation is the online setting: as queries arrive in a stream manner, how to revise the current partition incrementally to obtain the best partition for future queries. Contributions are (1) a formal formulation of this unexplored problem as a multi-objection optimization problem, (2) an evolutionary algorithm framework to explore Pareto-optimal partitioning solutions, and (3) an investigation on greedy online algorithms. Two case studies are considered: query-centric partitioning of an online social network and query-centric partitioning of a general distributed network. The findings are substantiated with evaluations using real-world datasets.