Please note: This PhD defence will take place online.
Amine Mhedhbi, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Semih Salihoğlu
Querying graph-structured relations or graph data for short, i.e., data with many-to-many relationships between entities, is ubiquitous and integral to a wide range of analytical applications, e.g., recommendations and fraud detection. These queries tend to be pattern finding queries that contain further relational operations such as filtering, projections, grouping, and aggregations. The join component of pattern finding queries act as a choke point for existing Database Management Systems (DBMSs). This is due to the joins being mostly many-to-many joins leading to an explosion in the size of intermediate results for which existing DBMSs do not optimize for. Consequently, users find that scalability is the most pressing challenge that current DBMSs have when querying graph data.
This thesis demonstrates that it is possible to make analytical DBMSs efficient on graph workloads by tackling the intermediate results size explosion problem. We adopt a set of novel query processing techniques, namely, worst-case optimal joins and factorized representations. This requires rethinking core components of a DBMS. We focus on query execution and optimization. Specifically, we propose: i) end-to-end worst-case optimal join adoption: we tackle join order optimization and adaptive join ordering when adopting worst-case optimal joins. We propose a novel plan space that mixes both worst-case optimal joins and traditional binary joins. We also propose a new cost metric for these plans, and a cardinality estimator; ii) factorized vector execution: we tackle the architectural challenges of adopting factorized representations in vectorized query executors. We propose a new vector representation for intermediate results passed between operators and describe necessary changes to these operators; and iii) caching and reuse of intermediate results: we enable the materialization of intermediate results to cache and reuse within a pipelined query executor. Pipelining and materialization are at odds and we describe novel dataflow plans that bridge both approaches.
Our three query processing approaches are integrated in GraphflowDB, a prototype analytical in-memory DBMS. This prototype system shows that analytical DBMSs need to adopt in order to efficiently query graph-structured relations.