Please note: This master’s thesis presentation will be given online.
Jeremy
Chen, Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
Join cardinality estimation is a fundamental problem that is solved in the query optimizers of database management systems when generating efficient query plans. This problem arises both in systems that manage relational data as well those that manage graph-structured data where systems need to estimate the cardinalities of subgraphs in their input graphs. We focus on graph-structured data in this thesis.
A popular class of join cardinality estimators uses statistics about sizes of small size queries to make estimates for larger queries. Statistics-based estimators can be broadly divided into two groups: (i) optimistic estimators that use statistics in formulas that make degree regularity and conditional independence assumptions; and (ii) the recent pessimistic estimators that estimate the sizes of queries using a set of upper bounds derived from linear programs, such as the AGM bound, or tighter bounds, such as the MOLP bound that are based on information theoretic bounds.
In this thesis, we introduce a new framework that we call cardinality estimation graph (CEG) that can represent the estimates of both optimistic and pessimistic estimators. We observe that there is generally more than one way to generate optimistic estimates for a query, and the choice has either been ad-hoc or unspecified in previous work. We empirically show that choosing the largest candidate yields much higher accuracy than pessimistic estimators across different datasets and query workloads, and it is an effective heuristic to combat underestimations, which optimistic estimators are known to suffer from.
To further improve the accuracy, we demonstrate how hash partitioning, an optimization technique designed to improve pessimistic estimators’ accuracy, can be applied to optimistic estimators, and we evaluate the effectiveness.
CEGs can also be used to obtain insights of pessimistic estimators. We show MOLP estimator is at least as tight as the pessimistic estimator and are identical on acyclic queries over binary relations, and the MOLP CEG offers an intuitive combinatorial proof that the MOLP bound is tighter than the DBPLP bound.