Location
MC 5479
Candidate
Xianglong Bai | Applied Mathematics, University of Waterloo
Title
Graph Neural Network-based Approximate Bayesian Computation for Agent-based Model Calibration of Bacterial Population Growth
Abstract
Approximate Bayesian Computation (ABC) has emerged as a powerful likelihood-free inference framework for model selection and parameter inference in complex biological systems where explicit likelihood functions are intractable or computationally prohibitive.
However, the effectiveness of ABC strongly depends on the choice of summary statistics and distance metrics used to compare simulated and observed data. When analyzing time-lapse observations of growing cell populations, the selection of suitable summary statistics often relies on manually designed features informed by domain expertise. Designing such statistics is challenging as they must capture complex spatial, structural, and temporal characteristics of the biological system. Consequently, handcrafted summary statistics may omit relevant information or fail to generalize across datasets. As a result, important information contained in the data may be lost, potentially leading to inefficient inference or biased posterior estimates. This motivates the use of data-driven approaches, such as Graph Neural Networks (GNNs), which can automatically learn informative representations directly from graph-structured data.
To address these limitations, this thesis proposes and systematically investigates four novel strategies for integrating deep learning-based representations into the Sequential Monte Carlo ABC (ABC-SMC) framework, with a focus on Graph Neural Networks (GNNs) and Long Short-Term Memory (LSTM) models. These architectures are specifically designed to capture the relational structure of cell populations and the temporal dynamics inherent in time-lapse data.
Using GNNs, we encoded spatial interactions among cells through contact edges in graph representations of the biological system. The temporal dynamics of the evolving cell population are captured in two ways. In one approach, Long Short-Term Memory (LSTM) layers are incorporated to model dependencies across successive graph observations in time-lapse sequences. In the alternative approach, we represent temporal relationships directly within the graph structure through lineage edges in a knowledge graph, which explicitly encode parent–daughter relationships between cells over time.
We consider two learning paradigms for extracting informative representations from these graphs. In the first approach, graph-level regression models are trained using mean squared error (MSE) loss to directly predict model parameters from simulated data. In the second approach, graph embedding models are trained with a triplet loss to learn low-dimensional representations that preserve the similarity relationships among simulations generated from similar parameter configurations.
The resulting representations serve as GNN-based summary statistics, replacing conventional handcrafted statistics within the ABC-SMC inference pipeline. Such data-driven approaches belong to the broader class of GNN-based methods for likelihood-free inference, which aim to automatically extract informative features from complex simulation outputs.
We evaluate the proposed strategies against a baseline approach relying on classical summary statistics. Inference performance is assessed using two complementary metrics: the Kullback-Leibler (KL) divergence between the inferred posterior distributions and the ground-truth parameters and the mean squared distance (MSD) between the inferred and true parameter values. Across all evaluated strategies, the GNN-based summary statistics consistently outperform conventional handcrafted statistics for simulation studies. They yield more accurate posterior approximations, as reflected by reduced KL divergence, and more precise parameter estimates, as reflected by lower MSD values. However, the results are less convincing on real data, likely due to model mismatch.
Overall, this work demonstrates that replacing handcrafted summary statistics with GNN-based ones can substantially improve likelihood-free inference in complex biological systems. By integrating GNNs with the ABC-SMC framework, the proposed approach enables the automatic extraction of informative representations from graph-structured, time-evolving population data. The resulting methodology provides a principled strategy for model calibration, bridging computational simulations and experimental observations through data-driven parameter inference. Although the biological model considered in this study serves primarily as a proof-of-concept for developing the inference pipeline, the proposed framework is designed to be readily extended to more complex agent-based and other mechanistic models commonly encountered in systems biology.