Please note: This PhD defence will take place in DC 2314.
Zeynep Korkmaz, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors M. Tamer Özsu, Khuzaima Daudjee
Graph database management systems (GDBMSs) are increasingly used to support traversal-heavy workloads. Graph queries often follow data-dependent paths through the graph, producing irregular or non-sequential accesses that do not align with the storage hierarchy, from data layout on disk to main-memory data pages. When storage and caching mechanisms are agnostic to graph structure, they can fail to exploit opportunities for locality. This thesis studies how the structural properties of graphs can be leveraged to improve locality across the storage hierarchy in disk-based GDBMSs. Although graph workloads are often characterized as having little locality, real-world graphs commonly exhibit strong connectivity patterns and community structure. These structural characteristics can provide useful information about which vertices and edges are likely to be accessed together during query execution. This thesis argues that incorporating graph topology into storage and cache-layer design enables the leveraging of locality benefits.
The first part of the thesis focuses on main-memory page management. We study how graph topology affects page access behaviour during graph traversals and show that topology-aware data layouts increase the likelihood that related graph data are collocated within pages. Based on this observation, we develop a cache replacement algorithm that uses structural information to better anticipate future page accesses, improving cache hits for graph workloads.
The second part of the thesis addresses data layout and serialization on disk. We investigate how graph data can be serialized to better represent connectivity patterns. We propose a topology-aware serialization strategy that captures real-world graph structure to support graph topology-aware cache replacement policies. We evaluate this strategy against existing approaches and demonstrate its effect on accepted metrics for judging locality quality.
Finally, we address a significant problem observed while designing the storage and cache-layer optimizations: the commonly used serialization quality metrics do not always correlate with actual query performance. We systematically study why standard metrics fail to consistently predict query efficiency, and we explore the reasons behind this irregularity.
Overall, this thesis makes three contributions: (1) it demonstrates how structural properties of graphs can be exploited to guide topology-aware cache replacement decisions in disk-based GDBMSs, (2) it proposes and evaluates a topology-aware serialization strategy for improving locality on disk, and (3) it provides an analysis of serialization locality quality metrics, exposing their limitations in representing query performance.