Master's Research Papers

2025

Investigating methods for handling covariates subject to limits of detection in linear and non-linear models

Author: Xiao, Y.
Supervisor: McGee, G.

In many epidemiological studies, one or more exposures of interest have missing observations when the measurement assay failed to detect values below its limit, leading to biased estimates for the relationship between exposures and the outcome. Although various advanced approaches to deal with exposures subject to Limit of Detection(LOD) have been proposed, the comparison of their performances for analysis of multiple linear model and additive model has not been investigated. In this research paper, we aim to investigate the analysis results of two regression models after applying each of six popular approaches which include deleting subjects with any censored exposures(CCA), substitution with average of observed values, missing data indicator(MDI) method with and without interactions, linear Multiple Imputation(MI) and non-linear MI. We conducted extensive simulation studies with continuous outcomes that are fully observed and continuous covariates that are left-censored to compare bias, variance and coverage of 95% confidence interval of estimates after fitting the linear regression model and generalized additive model (GAM) respectively. Simulation results with varied correlations between covariates and the percentage of left-censoring are presented to provide insights of the selection of tools for handling left-censoring in different scenarios for regression analysis. CCA was inefficient and could result in large bias for some scenarios. Substitution by the average of observed values can only provide valid inference for the estimates of slopes in the linear setting when the covariates are independent. The bias in the estimates introduced by MDI increases with correlations between covariates but it can be corrected by adding interaction terms. The linear MI was associated with consistent negligible bias and stable estimates for both mean response and regression coefficients, with coverage of confidence interval close to 95% in linear setting, but introduced systematic bias and unstable estimates in the non-linear setting, while non-linear MI improves the accuracy and efficiency of estimates, with coverage close to 95%. In conclusion, our study recommends using the linear MI when dealing with left-censored covariates in multiple linear regression analysis, and non-linear MI and MDI method have competitive performance and the selection between these two depending on the scenario when adopting additive model to analyze non-linear association.


Balancing Risk and Return in Retirement Products: An Extended Analysis of Tontines with Minimum Guarantees

Author: Xia, X.
Supervisor: Li, B.

The purpose of this research paper is to reproduce the key findings from Options on Tontines: An Innovative Way of Combining Tontines and Annuities by An Chen and Manuel Rach, exploring the innovative combination of tontines and annuities. In recent times, increased life expectancy, low interest rates, and strict solvency regulations have revived interest in tontines. Unlike annuities that shift longevity risk predominantly to insurers, tontines require policyholders to bear most of this risk. This study extends the original work by incorporating new analyses from both policyholders’ and insurers’ perspectives. It focuses on the variance of discounted payoffs of tontines with minimum guarantee and insurer’s expected profits, respectively. These additions provide deeper insights into the financial stability and attractiveness of these combined financial instruments. By varying the guaranteed payments, the new product can cater to policyholders with different risk aversion levels and liquidity needs, thereby offering better risk-sharing between policyholders and insurers than either annuities or tontines alone.


Causal Inference for Non-time Series Distributed Lag Models

Author: Wong, A.
Supervisor: McGee, G.

This paper investigates key methodological choices and challenges when estimating the causal effects of time-varying pollution exposures on health outcomes through the use of Distributed Lag Models (DLMs). Three key choices are explored: how to define and rep- resent the estimands of interest; how to address multicollinearity between exposure lags, a common issue that can distort effect estimates and mislead interpretations of windows of susceptibility; and how to select an appropriate causal inference method, such as standard- ization or marginal structural models (MSMs). Two simulations explore the implications of using marginal structural models to estimate causal effects. The first simulation demon- strates that using separate marginal structural models to estimate marginal causal effects at each lag can be effective at identifying windows of susceptibility. The second simulation explores outcome model misspecification, demonstrating that using a marginal structural model that models the exposure lags together to estimate the cumulative marginal effect for DLMs can provide value. The findings highlight the importance of future research into causal effect estimation for time-varying exposures.
 


Modeling Non-stationary Temperature Extremes: A Case Study of Paris and Detroit

Author: Zhang, Z.
Supervisor: Yang, F.

A process-informed framework is applied to analyze non-stationary temperature extremes, incorporating recent advances in Extreme Value Theory (EVT) and Bayesian inference. The Generalized Extreme Value (GEV) distribution is evaluated under both stationary and non-stationary formulations, with the Process-informed Nonstationary Extreme Value Analysis (ProNEVA) toolbox employed to accommodate flexible parameter–covariate relationships and quantify uncertainty via Markov Chain Monte Carlo methods. Using gridded temperature records from 1901 to 2024, comparative case studies are conducted for Paris and Detroit. In Paris, where a statistically significant upward trend in annual maximum temperatures is identified, a non-stationary GEV model with a quadratic trend in the location parameter offers superior fit and predictive performance. In contrast, Detroit shows no significant trend, and the stationary model demonstrates better parsimony and adequacy. These findings underscore the need for evidence-based model selection and demonstrate the utility of the ProNEVA framework in extreme value analysis under both stationary and non-stationary climatic conditions.


Mass Imputation Approach to Non-Probability Survey Samples

Author: Jin, J.
Supervisor: Wu, C.

In recent years, non-probability survey samples --- such as online opt-in panels and web-based surveys --- have become increasingly popular due to their convenience and cost-efficiency. However, their use poses significant challenges for statistical inference, as the underlying sampling mechanism is often unknown and potentially biased. This paper explores a mass imputation (MI) framework to facilitate valid inference for population parameters using non-probability survey samples. The MI approach leverages auxiliary covariates shared between a non-probability sample (with outcomes observed) and a reference probability sample (with outcomes missing) to impute missing values and estimate target population quantities. We present theoretical justifications for the method under standard assumptions, describe practical imputation strategies such as regression and k-nearest neighbor methods, and propose a bootstrap variance estimation procedure. Simulation studies demonstrate that deterministic regression achieves the lowest MSE for mean estimation, while randomized regression performs best for quantile estimation across all settings, especially under high predictive strength.


Exploring Statistical Arbitrage in Leveraged ETFs

Author: Wu, L.
Supervisor: Weng, C.

I identify structural discrepancies in the representative LETF-ETF spreads stemming from volatility drag, convexity, financing cost and various market frictions. The LETF-ETF daily spreads exhibit heavy tails and serial correlation, which can be modelled using an ARMA-GARCH framework. Risk-free interest rate substantially contribute to LETF underperformance. Building on these insights, I develop two arbitrage strategies: a predictive ARMA-GARCH model and a Hold-Rebalance approach, both of which demonstrate favorable risk-adjusted returns - especially for the fixed income ETF-LETF pair. Additionally, I explore statistical arbitrage between conceptually related but structurally different ETF-LETF pairs with a two-step regression-based framework. Back-testing results show that there exists persistent short-term mispricing in the pair spreads. This paper contributes to understanding LETF pricing inefficiencies and offers actionable strategies for arbitrage.


Natural Mechanisms behind Gompertz Law: A Comparative Analysis of Reliability Theory and Vitality Modeling

Author: Li, G.
Supervisor: Zhou, K.

This report investigates how Gompertzian mortality patterns can arise from two fundamentally distinct modeling paradigms: subsystem-based failure accumulation from reliability theory and vitality-based stochastic depletion. Building on these frameworks, we examine two representative models—the Multiple and Interdependent Component Cause (MICC) model and the Jump-Diffusion Vitality (JDV) model, both offering a physiologically mechanistic explanation for aging and mortality. We provide a theoretical formulation of both models and conduct parallel simulations using harmonized calibration against Canadian mortality data. In both cases, modifications to the original model specifications are introduced to improve empirical fit and biological realism. Simulation results show that while both models are capable of explaining Gompertz Law under certain conditions, the MICC model demonstrates superior performance in capturing mid-to-late life hazard exponentially acceleration and late-life mortality deceleration. In contrast, the JDV model offers greater flexibility in modeling early-life heterogeneity and random shocks. Together, the findings suggest that reliability and vitality modeling approaches provide complementary perspectives for understanding the dynamics of human mortality.


Inverse Probability Weighting Approach to Non-Probability Survey Samples

Author: Ding, Y.
Supervisor: Wu, C.

Non-probability survey samples have become increasingly common due to their cost-efficiency and accessibility, but they present fundamental challenges to valid statistical inference because of the lack of a known sampling design. To address this, researchers often leverage auxiliary information from a reference probability sample to estimate population-level quantities. This paper investigates three inverse probability weighting methods under the assumption of ignorable participation: pseudo maximum likelihood (CLW), adjusted logistic propensity (WVL), and a calibration-based method (CAL). We further extend this framework to address non-ignorable participation by proposing a new calibration-based estimation procedure that adjusts for selection bias arising from unobserved outcomes. Simulation studies evaluate the robustness and convergence of these methods under varying model conditions. Our findings underscore the importance of correct model specification and auxiliary variable informativeness, while also highlighting the practical challenges associated with non-ignorable selection. The proposed method contributes to the growing framework for valid inference from non-probability samples under more realistic assumptions.


Imputation-based Approaches to Causal Inference

Author: Lin, Z.
Supervisor: Wu, C.

Causal inference aims to understand the underlying mechanisms that govern changes in outcomes due to exposure to external causes. The goal of this paper is to explore the use of purely imputation-based methods to handle missing-by-design data, construct point estimators for the average treatment effect, and compare their performance with existing methods. We reviewed the basic settings of causal inference in the potential outcomes framework. Common approaches for estimating the average treatment effect, including propensity score matching, inverse probability weighting, and the doubly robust estimators, are discussed. The focus is on regression imputation and nearest neighbor imputation, and we investigate their finite-sample performance in the continuous outcome setting via a simulation study. We find that regression imputation performs similarly to the inverse probability weighted estimators and the doubly robust estimators when using correctly specified models, whereas nearest neighbor imputation performs well only when the treatment and control groups are approximately balanced in size. A discussion of missing data mechanisms is also included.


Index Insurance Design within a Conditional Value-at-Risk Framework

Author: Cheng, Z.
Supervisor: Weng. C.

Agricultural production is highly vulnerable to external shocks, making risk management mechanisms essential for sustainability. Index insurance is a promising alternative to traditional loss-indemnifying insurance, but its effectiveness is constrained by overly simplistic payout structures. This paper introduces a novel framework for index insurance design based on Conditional Value-at-Risk (CVaR) minimization. We derive a general optimal indemnity function that allows for flexible relationships between loss and index variables. The proposed approach only requires conditional quantile estimation and avoids the need to estimate the joint distribution of losses and indices, enhancing its practical applicability. We implement the theoretical results in a numerical example, designing weather index insurance to hedge against corn production risk. Results show that the proposed optimal insurance consistently outperforms linear-type products in risk mitigation. Interpretability tools are used to increase transparency and stakeholder trust. This study provides a flexible, risk-sensitive, and practically implementable methodology for modern index insurance design.


Estimating Causal Additive Interaction using Linear Odds-Based Models

Author: Cao, Y.
Supervisor: Zhu, Y.

Causal additive interaction is a measure of synergistic or antagonistic effects between two or more exposures, with wide applications in epidemiological research, such as multidrug-resistant tuberculosis. A common metric to quantify such interaction is the relative excess risk due to interaction (RERI), which measures the extent to which the joint effect of two exposures exceeds the sum of their individual effects. While the RERI is typically computed from the logistic regression via a non-linear transformation of the fitted regression coefficients, it becomes problematic when the outcome is rare, as the estimated coefficients are unstable with high variance. This paper revisits the linear odds model and extends it to the marginal structural linear odds model. Besides, we clarify the distinction between marginal and conditional RERI, which are frequently conflated in practice. We also show that the rare outcome assumption is crucial for consistent estimation of RERI, as it can make the two RERIs comparable. Through simulation studies, this paper evaluates the performance of each model under various causal structures and confounding effects. The results show that the MSLOM consistently yields smaller bias when the outcome is rare, with this advantage becoming more significant as confounding effects increase.


Optimal Insurance Design under Fairness Constraints based on Machine Learning

Author: Zhang, P.
Supervisor: Weng, C.

This research proposes a new machine learning approach to numerically solve insurance design problems under fairness constraints. Focusing on profile-invariant contracts with a zero correlation requirement, we approximate the optimal indemnity function using a trainable piecewise linear model. The method is applied to simulated data generated under predefined distributions, and the trained indemnity functions consistently exhibit a Z-shaped structure. This pattern reflects that, in mean-variance optimization under fairness constraints, extreme losses are retained for the policyholders in order to comply with the zero correlation standard as the fairness constraint. The results reveal the potential of machine learning as a robust and flexible framework for solving constrained optimization problems in actuarial science.


Data Combination with Surrogate Outcomes for Long Term Treatment Effect Estimation

Author: Chen, M.
Supervisor: Stevens, N.

Online Controlled Experiments (OCEs), colloquially referred to as A/B tests, are widely used to evaluate the effect of changes made to internet-based products, services, and initiatives. Such experiments typically run for a period of about two weeks during which metrics of interest are calculated to evaluate the impact of the treatment. While the duration of these experiments is short, practitioners are typically interested in understanding the long term impacts of treatments. A promising recent area of research uses surrogates for the primary outcome to estimate the long-term treatment effect. A wide variety of domain-specific methods have been proposed to estimate long-term treatment effects, but many solutions outside of the literature on surrogates may not generalize to experiments in other domains. We evaluate the performance of four approaches that leverage surrogate outcomes and verify the efficacy of all four approaches when the surrogacy assumption is satisfied and illustrate that the doubly robust and efficient estimators perform well when the number of available surrogates is limited. These results underscore the importance of relaxing potentially strict assumptions for long-term treatment effect estimation.


Statistical Analysis of Sparse Data Bias

Author: Ma, K.
Supervisor: Wen, L.

Estimating average causal effects when outcomes are very rare (such as studies of rare adverse events or lowincidence diseases) can be fraught with regression failures and misleading confidence intervals. In this research paper, we explore two practical solutions within the targeted maximum likelihood estimation (TMLE) framework. The first, called rareoutcome TMLE (rTMLE), enables the incorporation of information about the outcome process. The second, which we call TMLE_F, borrows Firth's Jeffreys prior correction to regularize the logistic regressions that underlie the nuisance function estimates of TMLE. After laying out the theory behind outcome bounding and penalized likelihood, we conduct a Monte Carlo study across different event rates (ranging from moderately rare to extremely rare), sample sizes, and various scenarios of model misspecification. We compare these TMLE methods with widely used estimators, such as inverse probability weighting and propensity score matching estimators. Our results show that both rTMLE and TMLE_F consistently deliver near-nominal 95% coverage in all settings where traditional methods fall short.


Inverse Probability Weighting Approach to Non-Probability Survey Samples

Author: Ding, Y.
Supervisor: Wu, C.

Non-probability survey samples have become increasingly common due to their cost-efficiency and accessibility, but they present fundamental challenges to valid statistical inference because of the lack of a known sampling design. To address this, researchers often leverage auxiliary information from a reference probability sample to estimate population-level quantities. This paper investigates three inverse probability weighting methods under the assumption of ignorable participation: pseudo maximum likelihood (CLW), adjusted logistic propensity (WVL), and a calibration-based method (CAL). We further extend this framework to address non-ignorable participation by proposing a new calibration-based estimation procedure that adjusts for selection bias arising from unobserved outcomes. Simulation studies evaluate the robustness and convergence of these methods under varying model conditions. Our findings underscore the importance of correct model specification and auxiliary variable informativeness, while also highlighting the practical challenges associated with non-ignorable selection. The proposed method contributes to the growing framework for valid inference from non-probability samples under more realistic assumptions.


Bayesian Hierarchical Modelling of Wastewater Data: Assessing Spatial Structure in SARS-CoV-2 Viral Load Across Ontario, Canada

Author: Oh, R.
Supervisor: Dean, C.

Wastewater surveillance is an effective tool for tracking the spread of SARS-CoV-2 across large populations. In this paper, we analyze wastewater viral load data collected across 33 Public Health Units (PHUs) in Ontario, Canada, over a 23 month period incorporating the potential presence of spatial correlation in SARS-CoV-2 viral loads. We employ a Bayesian hierarchical model that decomposes PHU-level random effects into spatially structured and unstructured components using an iCAR prior and independent error terms. Our analysis provided evidence of that there exists only minor spatial autocorrelation, suggesting that unstructured random variation dominates. Simulation studies further examine the properties of the model, specifically its behaviour under different prior specifications and varying levels of spatial dependence. Results show that the model provided reliable parameter estimates under moderate spatial autocorrelation, though there can be sensitivity to choices of the prior distribution for the parameter linked to the proportion of spatial variation.These findings support the use of hierarchical models for wastewater epidemiology while accounting the importance of spatial factors in inference.


Evaluating Bias in Joint Models For Recommended Visit Intervals Under Informative Observations.

Author: Leclair, J.
Supervisor: McGee, G.

The use of Electronic Health Records (EHR) data in longitudinal analysis has become increasingly popular. The nature of this data means that the observation times are often irregular and may inform the outcome, potentially introducing bias. Garrett et al. (2024) discussed the importance of including the recommended intervals of time between doctor visits, which are usually reported in EHR. Extending this idea, they introduced a joint modelling approach for the longitudinal outcome and these recommended visit intervals, under the assumption that the recommended visit process accounts for all of the informativeness in the observations. In this paper, we replicate their simulation study assessing bias in linear mixed models under informative observation. We then extend their simulation by considering situations where the recommended visit process does not fully contain the informativeness. We re- port the bias and 95% confidence interval coverage of covariate effects related and unrelated to random effects after fitting a joint linear mixed model for the longitudinal outcome and recommended visit process. We conclude with discussing when joint modelling is beneficial to use over the less complicated linear mixed effects model.


Identifying Conditions Favoring Multiplicative Heterogeneity Models in Network Meta-Analysis

Author: Xu, X.
Supervisor: Béliveau, A.

Explicitly modeling between-study heterogeneity is essential in network meta-analysis (NMA) to avoid overstating precision and to provide valid inference. The conventional random-effects model for aggregate data NMA assumes a constant additive between-study variance component. An alternative model assuming multiplicative heterogeneity inflates within-study variances by a common factor within a weighted least squares approach. Although multiplicative models have been studied in the context of pairwise meta-analysis, their performance in NMA remains under-explored. We conduct an empirical comparison of additive random-effects (RE) and multiplicative-effect (ME) models across 31 NMAs of two-arm studies summarizing odds ratios, risk ratios, and mean differences extracted from the nmadb database. Model fit is compared using ∆AIC, the difference in the Akaike Information Criterion between the additive and multiplicative models, with |∆AIC| ≤ 3 indicating similar support. We find that in 21 of 31 networks (67.7%), neither model is clearly preferred; however, the multiplicative model is favored (∆AIC < −3) in 7 networks and the additive model (∆AIC > 3) in 3 networks. Detailed case studies illustrate how the two approaches differ in their pooled treatment effects estimates and confidence intervals. RE models are sensitive to extreme and imprecise estimates, while ME models are sensitive to extreme and precise estimates. Our results suggest that multiplicative heterogeneity models warrant consideration alongside conventional random-effects models in NMA practice.


Cost-Performance Tradeoff Routing for LLMs with Selective Abstention

Author: Chen, T.
Supervisor: Maity, S.

We study the problem of predictive routing among multiple large language models (LLMs), where the goal is to select the model that best balances generation quality and cost for a given input. Building on prior work, we extend the routing framework by introducing an abstention option, allowing the router to defer a decision when the predicted utility is low. We also incorporate query-specific true cost in place of fixed per-model costs, enabling more fine-grained trade-offs between performance and efficiency. Our theoretical analysis shows that, under mild assumptions, incorporating abstention improves performance and reduces cost both at the population level and for routers with estimated performance and cost. Empirically, we evaluate the proposed approach on RouterBench and demonstrate that abstention yields the largest gains in low-cost regimes, and that query-specific cost improves robustness compared to constant-cost baselines. These findings suggest that abstention-enabled routing with query-specific true costs can be a practical and principled approach for efficient LLM deployment.


Controlling the False Discovery Rate for the Cox Regression using Knockoff Methods

Author: Zhu, S.
Supervisor: Diao, L.

Variable selection in survival analysis is critical for improving model interpretability and avoiding overfitting. This essay first reviews the field of survival analysis and variable selection, then presents a novel approach, CoxDeepKnockoff, which integrates deep generative knockoff methods with the Cox proportional hazards model to enable reproducible variable selection while controlling the false discovery rate (FDR). We provide theoretical guarantees for FDR control under mild assumptions and demonstrate the empirical performance of the proposed method through simulation studies. Results show that CoxDeepKnockoff achieves strict FDR control and high statistical power, outperforming or matching existing methods such as LASSO and model-X knockoffs. This study offers a promising framework for reliable variable selection in Cox regression, with potential applications in genomics and clinical research.


Nonparametric Distribution Estimation for Survival Data Using Double Generators

Author: Lin, R.
Supervisor: Diao, L.

Estimating conditional survival functions — the probability of survival beyond a certain time point given covariates — is a central problem in survival analysis. Traditional nonparametric techniques, such as kernel-based methods or tree-based ensembles, often falter when the relationship between covariates and survival times is nonlinear or when confronted with high-dimensional covariates. These limitations motivate the development of more flexible and scalable approaches. In this work, we propose a nonparametric regression method for survival analysis that leverages the power of deep generative models. Specifically, the proposed method employs a single neural network with two task-specific output branches that simultaneously parameterize the conditional generators for survival and censoring times, thereby modeling their joint distribution given covariate information. This double-generator setup enables the model to simulate synthetic data that respects the complex dependencies between survival outcomes, censoring mechanisms, and covariates, without relying on strong parametric assumptions. A key contribution of this work is its theoretical foundation. We demonstrate that under a set of mild identifiability conditions — weaker than the commonly assumed conditional independent censoring assumption — the proposed model is statistically valid. Furthermore, we derive the minimax optimal rate for the proposed estimator in Sobolev spaces. We evaluate the empirical performance of the proposed method through extensive simulation studies, comparing it against several state-of-the-art benchmarks in survival analysis. The results demonstrate the accuracy and robustness of the proposed approach in estimating the conditional survival function.


Bootstrap methods for meta-analyses of diagnostic test accuracy studies

Author: Fordjour, I.
Supervsior: Negeri, Z.

Meta-analysis of diagnostic test accuracy (DTA) studies commonly employs parametric models such as the bivariate normal-normal (BNN) and bivariate binomial-normal (BBN) models to synthesize Se and Sp. However, these models rely on asymptotic theory and distributional assumptions that may be violated with small sample sizes, leading to biased estimates, convergence issues, and unreliable inference. In this paper, we address a critical gap by proposing bootstrap methods—specifically, nonparametric bootstrap (NPB) and parametric bootstrap (PB) models, which do not require distributional assumptions for latent random effects and estimate the parameters using weighted least squares. Bootstrap methods achieved 100% convergence rates with robust confidence intervals across all sample sizes. We conducted comprehensive simulations to compare the performance of the proposed models with the standard models using several performance measures. The simulation results showed that for small numbers of studies, our proposed NPB and PB models outperformed BNN and showed comparable performance to BBN in bias estimation. BNN consistently overestimated Se and Sp across all scenarios. NPB demonstrated superior RMSE performance, particularly with small study numbers, and provided the most accurate estimates of between-study variances and covariance. Real-data applications confirmed these findings, with NPB producing the most reliable estimates of Se and Sp, and the narrowest confidence intervals, especially when distributional assumptions were violated. While all models perform comparably for large datasets, the proposed NPB model demonstrates clear superiority for small DTA meta-analyses, non-symmetric distributions, or when convergence is problematic. We recommend using BBN for large DTA datasets with known distributions, but strongly advocate for NPB when dealing with small samples or violated distributional assumptions, as it ensures 100% convergence and provides valid, reliable inference essential for clinical decision-making.


Binomial Mixtures: Theory and Numerical Estimation of the NPMLE

Author: Lopez Cortes, X.
Supervisor: Marriott, P.

Mixture models have been widely applied across various disciplines, including medicine, engineering, and astronomy. This research paper is divided into four main sections. The first section introduces the concept of mixture models and reviews recent developments in the field. The second section explores Lindsay’s theoretical framework for mixture models, highlighting the role of geometry in understanding identifiability and estimation of the non-parametric maximum likelihood (NPMLE). The third section presents numerical experiments for computing the NPMLE using three algorithms (MOSEK, EM, and REBayes), and compares their performance in estimating the NPMLE for a binomial mixture. The final section presents some ideas for future research regarding mixture models.


Replication and Extension of Tonuity

Author: Zhen, Y.
Supervisor: Li, B.

According to Chen et al. (2019), tonuity is a combination of ton tine and annuity. With the introduction of Solvency II, European life insurance companies are now facing more strict rules on capital provi sion, especially for those selling retirement products such as annuity, which contain longevity risk. In order to deal with this situation, com panies may increase the price of their products, thus probably leading policyholders to seek alternative products. In the extreme case, the policyholder may join the tontine where insurance companies just un dertake an administrative role and the longevity risk is shared with the policyholders in the pool. Although tontine has lower premium, it doesn’t have a steady cash flow as annuity, worried by those who want to avoid uncertainty. In this paper, firstly we replicate what Chen et al. (2019) get on the annuity, tontine and tonuity. Then we make some extension by substituting assumptions in the article and get some new conclusions.


2024

Integrating uniCATE and CATE: A Two-Step Approach for Predictive Biomarker Discovery

Author: Rangipour, Z. 
Supervisor: Dubin, J.

Through a comprehensive simulation study, this paper screens varying dimensions of biomarker data, subsequently evaluating the association of the screened biomarkers with a health response of interest. The study extends upon the work by Boileau et al. (A flexible approach for predictive biomarker discovery) by integrating their proposed two-step method, involving their novel biomarker discovery method, uniCATE, into our analysis framework. This involved filtering predictive biomarkers using a threshold criterion and then applying CATE to the remaining features. These extensions were strategically implemented with the primary aim of advancing biomarker discovery research within the domain of biostatistics. Across our study, we observed that the methods we used for biomarker identification performed more effectively in scenarios characterized by less complex Data Generating Processes (DGPs), such as the Opposite Symmetric Linear DGP. Simplified relationships between variables within these DGPs produced more consistent and reliable identification of predictive biomarkers. In scenarios where the number of biomarkers (p) was 500 and the sample size (n) was either 100 or 500, our study achieved more accurate identification of the most predictive biomarkers. Particularly, when n was 500, the models consistently selected the correct biomarkers more frequently than when n was 100. The study highlights the significance of methodological approaches in identifying biomarkers and demonstrates the effectiveness of incorporating uniCATE and Conditional Average Treatment Effect (CATE) methods across different simulated scenarios.


A spatio-temporal analysis of avian influenza H5N1 outbreaks, focusing on the impact of season and climate

Author: Bandara, P.
Supervisor: Dean, C.

The spread of H5N1 avian influenza virus in Canada poses significant challenges for public health and ecological stability. This study assesses the spatial and temporal dynamics of H5N1 outbreaks among wild birds and mammals in Canada between 2021 and 2023, with a focus on statistical modelling that helps in understanding the impact of climate and season on outbreaks. Employing poisson, negative binomial, logistic, zero-truncated, zero-inflated poisson, and zero-inflated negative binomial models, we identify several count models that best fit the data. The model selection process was guided by statistical criteria such as Akaike Information Criterion(AIC), likelihood ratios, and assessments of overdispersion. An application of the space–time permutation scan statistic, which relies solely on case data without requiring population-at-risk figures, facilitated the identification of high-risk areas. These areas were mapped using ArcGIS for enhanced geographical visualization. This analysis concluded that the zero-inflated negative binomial model provided a fair fit for the H5N1 case data, highlighting significant overdispersion and a higher prevalence of zero counts than expected in a poisson distribution. Seasonality was identified as a key influence, with varying incidence rates across different seasons. Correlations were observed between H5N1 case counts and human population density, as well as environmental variables such as temperature and precipitation. The study also pinpointed specific geographical and temporal clusters where the risks of H5N1 outbreaks were statistically higher. This study offers valuable statistical insights into the dynamics of H5N1 spread in Canada. The findings highlight relevant disease patterns, aiding in the formulation of targeted and effective disease control strategies to mitigate the impact on both human health and wildlife.


Computational Tools for the Simulation and Analysis of Spike Trains

Author: Afable, JV.
Supervisor: Marriott, P.

This paper presents a set of tools and a workflow for replicating and modifying a spiking neural network simulation of the olfactory bulb using NEURON. Key concepts in computational neuroscience are first reviewed, including spike trains, neuron models, network architectures, and the biological circuitry of the olfactory bulb. The process of replicating an existing olfactory bulb simulation study is then described in detail. Modifications to the model are explored, investigating the effects of changing the random seed and adjusting mitral-granule cell network connectivity. Results demonstrate consistent network behavior across seeds, but a strong dependence of mitral and granule cell spiking activity on connectivity between these populations. The computational workflow establishes a framework for replicating and extending published neural simulations.


Developments in Neural Simulators in Computational Neuroscience

Author: Ladtchenko, V.
Supervisor: Marriott, P.

In this paper we will look at how In Silico studies have allowed scientists to minimize invasive procedures such as those required in neuroscience research. We will discuss simulation as an alternative and look into the inner workings of a simplified version of a modern simulator. Then we will discuss the mathematical modeling commonly used in simulators. We will look at deterministic and stochastic models, which are two ways of modeling brain neural networks. Next, we will look at simulators in detail and discuss their advantages and disadvantages. We will in particular focus on NEURON because it is the most popular simulator, and then on BrainPy which is a recent development. Finally, we will perform experiments on the NEURON and BrainPy simulators and find that we can (1) run the same model in BrainPy for 10,000 neurons in under 1 second, which runs in 1 minute for 1,000 neurons in NEURON, signifying a 600 times speed-up. (2) we will also show that we can run a simulation with up to 50,000,000 neurons using BrainPy, which we cannot do in NEURON.


Unveiling pitfalls and exploring alternatives in the use of pilot studies for sample size estimation

Author: Ji, C.
Supervisor: Zhu, Y.

Pilot studies are used to estimate effect sizes, which in turn are used in power calculations to determine the sample size needed for the main study to achieve a prespecified power and significance level. In this paper we explore the pitfalls of using small pilot studies to perform these estimates. Additionally, we examine three alternatives to determine a sufficient sample size needed for the main study, the corridor of stability, which utilizes bootstrapping to determine a sample size at which the estimate of the effect size will become stable, as well as two Bayesian metrics, the average coverage criterion, and the average length criterion, which involve controlling statistics based on the posterior distribution of the effect size. All three of these metrics are more robust than current methods to determine sample sizes and effect sizes from small pilot studies. Both Bayesian metrics are unaffected by sample size, and hence may be able to bypass the need for pilot studies altogether.


Implementable Portfolios to Mitigate Estimation Risk

Author: Kokic, S.
Supervisor: Weng, C.

The quest towards determining the best possible investment strategy in financial settings has been on-going for centuries. Introduced by Harry Markowitz in 1952, Modern Portfolio Theory (MPT) revolutionized the field of portfolio optimization, bringing a statistical intuition to performance evaluation. The so-called “Markowitz rule” is not without its flaws: sub-optimal returns in comparison to that of the true optimal portfolio, and requirements of a large number of assets with extensive history leave much to be desired. As a result, many different portfolio optimization frameworks have been proposed in the decades since Markowitz’s groundbreaking paper. This research paper investigates these alternative frameworks in two different investment universes: one with a risk-free asset considered in investment, and the other without. By leveraging historical return data and various estimation techniques, this paper examines the performance of different portfolio optimization rules under both risk-free and no risk-free asset scenarios. Empirical results using real-world data highlight the superiority of dynamic portfolio rules, such as the implementable three-fund rule and the Kan & Zhou [2022] rule, over traditional methods like the naive and plug-in rules. These findings underscore the importance of considering estimation risk, and adopting more sophisticated strategies to achieve higher returns while managing risk adequately.


Two-phase designs for bivariate failure time models using copulas

Author: Yuan, L.
Supervisor: Cook, R.

In health studies, two-phase designs are used to cost-effectively study the relationship between an expensive-to-measure biomarker and a response variable. Failure time data, aka survival data, also arises in many health and epidemiological studies, and may involve multiple dependent failure times for each individual. When studying the relationship between a biomarker and bivariate failure times, the dependence of the failure times also needs to be considered. This essay considers issues in the development of two-phase studies for studying the relationship between a biomarker and bivariate failure times, and when using copula models to accommodate the dependence between bivariate failure times.


Polya Trees for Right-Censored Data

Author: Zhao, Y.
Supervisor: Diao, L.

We estimate the survivor function based on right-censored survival data. Conventionally, the survival function can be estimated parametrically by assuming that the survival time follows a parametric distribution, e.g., exponential or Weibull distribution and estimating the parameters indexing the distribution through maximum likelihood method. Alternatively, it can be estimated nonparametrically using Kaplan-Meier (Kaplan & Meier, 1958) estimator. Parametric methods are efficient under correctly-specified model, however, lead to biased and invalid results if the distribution of survival time is misspecified. Nonparametric methods are robust and free of risk of misspecification but subject to loss of efficiency. The proposed Polya tree approach strikes a balance between the two. Polya trees are a convenient tool frequently adopted in nonparametric Bayesian literature to solve different problems. Muliere & Walker (1997) constructed Polya trees for right-censored data, designed tree structure that depends on observed data, and introduced priors that take partition length into consideration. Neath (2003) built tree structure that does not depend on data and modeled by using a mix of Polya trees. We introduce a probability allocation method which can work with data-dependent or data-independent partitions. We conduct intensitive simulation studies to assess the performance of the proposed method. The proposed Polya trees improve the performance or perform as well as the Polya trees proposed by Muliere & Walker (1997).


Measurement System Comparison using the Probability of Agreement Method with Assumption Violations

Author: Chan, B.
Supervisors: Stevens, N., Steiner, S.

In clinical and industry settings, we are often interested in assessing whether a new
measurement system may be used interchangeably with an existing one. The probability
of agreement method quantifies the agreement between two measurement systems, relying
on assumptions of normality and homoscedasticity. However, these assumptions are often
violated in practice. In this paper, we discuss the heteroscedastic probability of agreement
method proposed by Stevens et al. (2018), and explore the probability of agreement
method adapted for log-transformed data as an alternate approach to addressing assumption
violations. We compare and contrast the two approaches theoretically and empirically
through a case study of skewed heteroscedastic data.


Investigating the Performance of Direct and Indirect Causal Effect Estimators under Partial Interference and Structured Nearest Neighbour Interference

Author: Malnowski, V.
Supervisor: McGee, G.

In the framework of causal inference, interference occurs when one subject's treatment has a causal effect on another subject's potential outcomes. This indirect causal effect has shifted from from being viewed as a nuisance in the past to being the primary causal effect of interest in many contexts. This paper outlines the methods proposed by Tchetgen and VanderWeele (2012) to estimate the population average direct, indirect, total, and overall causal effects and to quantify their uncertainty in data exhibiting stratified partial interference using Hajek style IPW point estimators and sandwich form variance estimators. We then conduct a simulation study demonstrating these estimators consistently and efficiently estimate causal indirect effects not only in stratified partial interference settings, but also in data generated under structured nearest neighbour interference. We then apply the outlined methods and simulation study results to an agronomy dataset where we answer a relevant question from the literature regarding whether one crop's emergence date has a causal effect on another crop's grain yield by simultaneously testing for stratified partial interference and structured nearest neighbour interference.


Generative Methods for Causal Inference

Author: Zheng, S.
Supervisor: Diao, L.

Estimating the causal effect due to an intervention is important in many fields including education, marketing, health care, political science and online advertising. Causal inference is a field of study that focuses on understanding the cause-and-effect relationships between variables. Causal inference can be derived from both randomized controlled tri- als and observational studies. While randomized controlled trials are considered the gold standard for establishing causality, observational studies often require more sophisticated statistical methods to account for potential biases and confounding factors. A core concept in analyzing observational data is the notion of counterfactuals - what would have hap- pened to the same subject if they had been exposed to a different condition. In this essay, our discussion is also under the counterfactual framework. We study the advancements in integrating causal inference with deep learning, focusing on two prominent models: Causal Effect Variational Autoencoder (CEVAE) and Generative Adversarial Network for Inference of Treatment Effects (GANITE). We compare these two models through analyzing two real datasets and we find that GANITE consistently outperforms CEVAE in terms of performance metrics. Both CEVAE and GANITE exhibit areas for improvement. Future research should aim to combine the strengths of both models to develop more precise and robust approaches for causal inference, addressing the identified challenges and enhancing the accuracy of the methods.


Robustness and Efficiency Considerations when Testing Process Reliability with a Limit of Detection

Author: Bumbulis, L.
Supervisor: Cook, R.

Processes in biotechnology are considered reliable if they produce samples satisfying regulatory benchmarks. For example, laboratories may be required to show that levels of an undesirable analyte rarely (e.g. in less than 5% of samples) exceed a tolerance threshold. This can be challenging when measurement systems feature a lower limit of detection rendering some observations left-censored. In this paper we discuss the implications of detection limits for location-scale model-based inference in reliability studies, including their impact on large and finite sample properties of various estimators; power of tests for reliability and goodness of fit; and sensitivity of results to model misspecification. To improve robustness we then examine other approaches, including restricting attention to values above the limit of detection and using methods based on left-truncation, exact binomial tests, and a weakly parametric method where the right tail of the response distribution is approximated using a piecewise constant hazard model. This is followed by simulations to inform sample size selection in future reliability studies and an application to a study of residual white blood cell levels in transfusable blood products. We conclude with a brief discussion of our findings and some areas for future work.


Distribution of L1 distance in the unit hypercube

Author: Gajendragadkar, R.
Supervisor: Drekic, S.

We derive the exact distribution of the L_1 distance between two points sampled uniformly at random from an n−dimensional unit cube, and propose a hypothesis test based on this distribution for detecting dependence between the columns of a random matrix. Finally, we generalize the distribution each coordinate of the sampled points follows to a Beta(a, b) distribution and conjecture an asymptotic result. Several comparative plots are provided to demonstrate the obtained results.


A Comparison between Joint Modeling and Landmark Modeling for Dynamic Prediction

Author: An, S.
Supervisor: Dubin, J.

This research essay presents a comprehensive comparison between Joint Modeling (JM) and Landmark Modeling (LM) approaches for dynamic prediction in longitudinal data analysis. The study utilizes simulation studies and real-world data applications to evaluate the predictive performance of both models. The JM approach integrates a linear mixed- effects model for longitudinal biomarker measurements with a Cox proportional hazards model for survival data, providing a robust framework for dynamic predictions. In contrast, the LM approach updates prediction models at key time points using the latest longitudinal data, offering flexibility in handling time-varying covariates. Simulation results indicate that JM generally outperforms LM in predictive accuracy, particularly under conditions of high residual variance and long prediction horizons. However, LM demonstrates strengths in handling irregular measurement times and integrating short-term event information. The application to the Prothros dataset, involving patients with liver cirrhosis, illustrates the practical implications of both models. It highlights JM’s superior performance in the early years and LM’s variability in later years. This study underscores the importance of selecting appropriate models based on specific data characteristics and predictive goals. It suggests avenues for future research in non-linear trajectories and multi-biomarker integration to further enhance dynamic prediction methodologies.

 


Diagnostic test accuracy meta-analysis based on exact within-study variance estimation method

Author: Dabi, O.
Supervisor: Negeri, Z. 

A meta-analysis of diagnostic test accuracy (DTA) studies commonly synthesizes study-specific test sensitivity (Se) and test specificity (Sp) from different studies that aim to quantify the screening or diagnostic performance of a common index test of interest. A bivariate random effects model that utilizes the logit transformation of Se and Sp and accounts for the within-study and between-study heterogeneity is commonly used to make statistical inferences about the unknown test characteristics. However, it is well reported that this model may lead to misleading inference since it employs the logit transformation and approximate within-study variance estimate. Alternative transformations which do not require continuity corrections such as the arcsine square root and Freeman-Tukey double arcsine were recently proposed to overcome the former limitation. However, these solutions also suffer from using approximate within-study variance estimates, which can only be justified when within-study sample sizes are large. To overcome these problems, we propose an exact within-study variance estimation approach which does not require a continuity correction and is invariant to transformations. We evaluate the proposed method compared to the existing approach using real-life and simulated meta-analyses of DTA data. Our findings indicate that both methods perform comparably when there are no zero cell counts in the DTA data and the sample sizes (the numbers of diseased and non-diseased individuals) per study are large. The approximate method significantly underestimates the summary Se and Sp, especially when the true Se and Sp pairs are closer to 1. However, the analytical method has better bias, root mean squared error (RMSE), confidence interval (CI) width, and coverage probability for Se and Sp when the true Se and Sp are large. Similar results are found when comparing the methods in terms of the between-study variance-covariance parameters. Therefore, researchers and practitioners can use either of the within-study variance estimation methods for aggregate data meta-analysis (ADMA) of (DTA) studies without zero cell counts and large within-study sample sizes. Conversely, the analytical method should be preferred over the approximate technique for ADMA of DTA studies with zero cell counts or small within-study sample sizes.


Imputation Approach to Missing Data and Causal Inference

Author: Huang, Z.
Supervisor: Wu, C.

We provide a critical review of imputation approach to missing data analysis and causal inference. We present general settings and methodologies for each topic, discuss key assumptions for the validity of the methods, and highlight the connections and common features of these two seemingly distinct areas under the unified framework for imputation-based methods. Our simulation studies substantiate the practicality of applying imputation techniques originally developed for missing data to estimate average treatment effects in causal inference, demonstrating their effectiveness and versatility.


A Review and Comparison of Multiple Testing Procedures.

Author: Wu, R. P.
Supervisor: Stevens, N.

This paper provides a broad comparison and review of various multiple testing procedures ranging from classical methods such as the Bonferroni correction, Holm’s procedure, Hommel’s procedure, and Hochberg’s procedure, to more recent methods like the PAAS procedure, the Fallback, and the 4A procedure. We evaluate the performance of these procedures through simulation studies, considering various levels of marginal power, correlation, and the number of well-powered and under-powered endpoints. Our simulation results reveal that although the Bonferroni correction is overly conservative, the practical difference in its empirical power compared to the other methods is small in many settings. However, more advanced procedures, such as the Fallback and 4A procedure, achieve higher empirical power but at the loss of simplicity and interpretability. In contrast, the General Multistage Gatekeeping (GMG) procedure, which groups hypotheses into families based on criteria such as endpoint importance, demonstrates lower empirical power compared to other methods in the context of multiple endpoints. The results and insights gleaned from this paper underscore the importance of choosing an appropriate multiple testing procedure based on the specific use case. Our findings suggest that while more advanced methods can ensure control over the family-wise error rate, they introduce an added layer of complexity both in application and interpretation. This paper aims to serve as a guideline and ‘play-book’ for researchers and industry professionals in selecting the right multiple testing procedure for their respective circumstances.


dWOLS precision medicine implementation with measurement error from treatment non-adherence

Author: Mawer, K.
Supervisor: Wallace, M.

Precision medicine tailors treatments based on details about a patient to optimize the response. Moreover, a dynamic treatment regime (DTR) is a formalized application of precision medicine that incorporates a treatment rule or a series of treatment rules. We want to find the optimal DTR that maximizes an outcome. We use dWOLS, a method of DTR estimation that weighs atypical treatment plans more heavily. Ideally, patients fully adhere to the treatment they are prescribed. However, people may not adhere to this treatment, which results in measurement error with the treatment variate, as the person's actual treatment will differ from the prescribed treatment. As people may not adhere due to unwillingness to undergo side effects or forgetfulness, we define variables based on these as personality variables relating to openness and conscientiousness, respectively. For our treatment variate, we assume that we have an experimental and control treatment, where people who do not adhere are treated as though they take the control treatment. Despite how measurement error can affect DTR estimation, we can use dWOLS to model treatments even when there is measurement error in treatments. We simulate the effects of non-adherence to understand the bias and other potential problems, including when the non-adherence depends on the variates we use when we tailor the treatment. We can rectify the bias with the estimators when there is non-adherence, with rectifications being easier when this non-adherence is independent of the tailoring variate.


Measurement Error in the Tailoring Covariate and Its Association with Group Membership

Author: Sivathayalan, J.
Supervisor: Wallace, M.

Dynamic treatment regimes provide a framework for providing personalized interventions for a given condition, but their construction relies on error-prone measurements of covariates and treatments. Such measurement error may be associated with individuals' membership in certain groups, such as sociodemographic categories, and therefore affect the adequacy of the treatment(s) they are given. Dynamic weighted ordinary least squares has been established as a doubly robust method of estimation for dynamic treatment regimes, with a relatively straightforward implementation; this paper explores its use in a single-stage regime, with a sample consisting of individuals from groups measured with varying amounts of error. Simulation results show better accuracy in treatment assignment for those in a group measured with less error, and that greater sample size and a lower magnitude of error generally lead to improved accuracy. They demonstrate the impact of measurement error in a setting where its effect varies based on group membership, and how this can affect the quality of treatment received.


Spatio-temporal Data Analysis

Author: Ge, R.
Supervisor: Dubin, J.

The paper explores the challenges and methodologies involved in analyzing spatio-temporal data, which is increasingly generated from various sources such as remote sensing, mobility data, wearable devices, and social media. Spatio-temporal data, characterized by its spatial and temporal components, requires sophisticated analytical methods due to its complexity and the inherent spatial autocorrelation. Significant challenges in spatiotemporal data analysis include handling errors from missing observations, systematic biases, and measurement inaccuracies. The integration of spatial and temporal database models into unified spatio-temporal models has been a focus of recent research, aiming to improve the practical application and development of these models.

Bayesian hierarchical models are emphasized for their ability to incorporate time and area effects, providing insights through the interpretability of neighborhood structures and adjacent times. However, these models traditionally rely on Markov Chain Monte Carlo methods, which are computationally intensive. This research essay presents the Integrated Nested Laplace Approximation (INLA) as a computationally efficient alternative for Bayesian analysis, especially suitable for latent Gaussian models. INLA offers significant computational advantages, providing precise estimates in a fraction of the time required by traditional methods. Additionally, the paper discusses the application of Generalized Linear Mixed Effects Models (GLMMs), which have gained popularity for modeling spatio-temporal data due to their flexibility in handling different types of data and accounting for spatial random effects. The GLMM framework is capable of capturing the correlation between observations over time and space, making it a valuable tool for spatio-temporal data analysis. In this essay, I highlight the potential of Bayesian hierarchical models and GLMMs, alongside computational advancements like INLA, to enhance the accuracy and efficiency of spatiotemporal data analysis. The study suggests avenues for methodological refinement and emphasizes the need for careful prior selection to ensure reliable estimates in practical applications. Future research should incorporate real-world data and explore more complex spatial-temporal correlations to enhance the applicability and robustness of INLA models.


Sequential Tennis: A Tennis Engine for Coaching, Commentary, Evaluation, and Simulation

Author: Wang, C. 
Supervisor: Drekic, S.

This research paper introduces a novel method to model and simulate tennis rallies, using the sequential nature of tennis to abstract a rally into a game tree while preserving key components such as shot trajectories, hitting windows, player movement speed, and shot risk. The resulting game complexity is estimated with respect to various metrics and compared to other sequential games such as chess and Go. After the model is constructed, a modified version of the negamax algorithm incorporating risk can be applied to obtain an engine which evaluates player decision making and recommends optimal strategies and tactics. This is demonstrated with a case study of how a rally can develop when a server employs a kick serve, showcasing potential applications in coaching, match commentary, player evaluation, and simulation for game development and match prediction.


Analysis of Multi-Server Queuing Systems with Batch Arrivals: Applications in Insurance Claim Processing

Author: Zhang, Z.
Supervisor: Drekic, S.

This report presents a comprehensive analysis of a multi-server queuing system with batch arrivals, focusing on its application in insurance claim processing. We investigate a model where claims arrive in batches at regular intervals and are processed by adjusters with exponentially distributed service times. The study covers general, single-server, and infinite-server cases, deriving expressions for unprocessed claims per batch and total unprocessed claims. A key finding reveals that under the single-server case, when batch sizes follow a Discrete Phase-type (DPH) distribution, unprocessed claims per batch also follows a DPH distribution, enabling the application of matrix-analytic methods. We explore system behavior under various conditions, discuss practical implications for insurance claim processing, and address computational aspects for large-scale systems. The analysis provides insights into resource allocation, system efficiency, and performance optimization. The report concludes by identifying areas for future research, contributing to the broader understanding of batch processing in queuing theory and its real-world applications.


GAM SymbolicGPT: A Generalized Additive Model Approach to SymbolicGPT

Author: Zhang, D.
Supervisor: Ghodsi, A.

Symbolic regression is the process of deriving a mathematical expression that best de- scribes the underlying relationship between a set of input and output variables. While deep learning-based approaches have achieved significant success in this domain, they of- ten struggle with high-dimensional input data due to the immense search space. In this work, to address the issue of high dimensionality, we extend an existing deep learning method, SymbolicGPT, by proposing a novel algorithm, GAM SymbolicGPT, inspired by the backfitting algorithm used for fitting generalized additive models. Through experimen- tation, we highlight the limitations of our method in addressing high-dimensional symbolic regression tasks.


Meta-Modeling for Fair Fee Determination in Registered Index-Linked Annuities (RILAs)

Author: Quan, H.
Supervisor: Feng, B.

This paper introduces a novel approach to analyzing Registered Index-Linked Annuities (RILAs), an emerging financial product that blends the features of Variable Annuities (VAs) with simpler characteristics. This unique combination facilitates the application of advanced meta-modeling techniques. We developed a comprehensive simulation model to evaluate RILA’s performance, drawing from a compressor, simulator, predictor structure. The model incorporates diverse factors such as smoking behavior, residency, and age, and we adapt the objective for enhanced industry applicability. Firstly, we propose a regression- based method to determine the fair fees for labeling RILA contracts for predictor use. Secondly, we utilize various predictors to compare and assess the model’s performance.


Differential Privacy: a Survey and Review

Author: Qin, Y. 
Supervisor: Chenouri, S.

In an era where data privacy has become increasingly crucial, differential privacy emerges as a leading framework for safeguarding individual information while enabling the analysis of large datasets. This paper presents a brief survey and review of differential privacy, exploring its foundational principles, key mechanisms, and practical applications. We examine the theoretical underpinnings of differential privacy, including the central and local models, and discuss the trade-offs between privacy guarantees and data utility. Through a brief introduction of common techniques such as the Laplace and Gaussian mechanisms, we highlight their effectiveness in various contexts and their flexibility, versatility in statistical analysis. Drawing on insights from existing literature, we also briefly discuss future directions for differential privacy research.


Investigating the Performance and Parametrization of a Multiscale Spike Train Model

Author: Das, K.
Supervisor: Ramezan, R.

The brain, as the central organ of the nervous system, controls various complex processes through the activity of neurons, which communicate via sequences of consecutive action potentials called spike trains. This paper evaluates the performance of estimation algorithms and software for two multiscale models for the intensity function of an inhomogeneous Poisson process proposed by Ramezan and his colleagues in 2014 for neural spike trains. Through simulations, we focus on the recovery of known multiscale intensity functions with one or two periodic components. We will also tackle the dimensionality issues of these models. Simulation results demonstrate that while the smoothed periodogram effectively identifies the original frequency values and initial phases, challenges arise in accurately estimating the models’ parameters. Significant trial-to-trial variability indicates that the models struggle to provide low-variance parameter estimates across trials. Studying the Fisher Information matrix, we have observed “practical unidentifiability” in these models which is defined for cases where the loglikelihood is theoretically curved, but the curvature is too small for meaningful joint inference about the parameters.


Mimic Modelling of Expected Goals in Soccer Analysis

Author: Owusu Boateng, B.
Supervisor: Davis, M.J.

Expected goals (xG) have become a key metric in modern soccer analytics, providing a reliable way to evaluate the quality of scoring opportunities. This study investigates the factors that influence xG using shot location data from the 2014 La Liga season. By applying a mixed-effects modeling approach, we analyze how factors such as shot distance, shot type, game situation, and the influence of individual players and teams affect the probability of scoring. The dataset includes approximately 90,000 shot attempts, capturing detailed information on shot coordinates, shot type (e.g., right foot, header, penalty), game context (e.g., open play, set pieces), and player and team identifiers. To address the complexity of soccer, mixed-effects models were employed to account for both fixed effects—such as shot distance and game situations—and random effects tied to players and teams, which capture variability at both individual and team levels. The results show that shot distance has a strong inverse relationship with scoring probability, with shots taken closer to the goal resulting in higher xG values. Certain shot types, such as penalties, significantly increase xG, while headers exhibit greater variability in their effectiveness. Game situations, such as open play versus set pieces, also reveal distinct patterns in scoring likelihood, reflecting differences in tactical approach. The inclusion of random effects highlights the importance of player and team-specific factors, indicating that individual skill and team strategy play crucial roles in determining shot outcomes. This study demonstrates the value of mixed-effects models in providing a detailed understanding of the factors that influence scoring in soccer. By accounting for both fixed and random effects, the model offers a thorough analysis of shot effectiveness and variability across players and teams. These findings have practical applications for coaches and analysts seeking to refine training methods and optimize tactics for winning soccer matches. The research also suggests opportunities for further exploration, particularly by expanding the dataset to multiple seasons or incorporating additional factors such as player positioning, opponent pressure, and match dynamics to enhance the predictive accuracy of xG models.


Examining Violations of the Poisson Process Assumption in Ice Hockey Goals

Author: Uchendu, C.
Supervisor: Davis, M.J.

The Poisson process is a foundational framework for modeling event timings. Still, its assumptions of constant rate and independence are often violated in the dynamic context of ice hockey goal scoring. This study critically examines these violations using play-by-play data from the 2022–2023 NHL season. By analyzing the inter-goal times (time between goals) across different game periods and the entire game, the study evaluates alternative statistical models, including the exponential, mixture of exponentials, gamma, generalized gamma, and beta distributions, to address the limitations of the Poisson process. These models were fitted using maximum likelihood estimation (MLE) and evaluated with model selection criteria (AIC and BIC). Censoring adjustments were implemented to account for truncated data due to fixed game durations. Results indicate that the Poisson process fails to adequately describe goal intervals, with significant deviations observed across periods. Among the tested models, the (censored) generalized gamma distribution consistently outperformed others, capturing the heavy-tailed and bursty nature of goal-scoring intervals. The beta distribution also showed strong performance in modeling time-bounded periods. These findings highlight the need for more flexible models in hockey analytics to account for variability and dependencies in goal scoring. Practical implications include applications in coaching strategies and sports betting, where accurate modeling of scoring dynamics can provide a strategic advantage. Future research should incorporate contextual and spatial-temporal factors to further enhance model precision.