Master's Research Papers

2024

Integrating uniCATE and CATE: A Two-Step Approach for Predictive Biomarker Discovery

Author: Rangipour, Z. 
Supervisor: Dubin, J.

Through a comprehensive simulation study, this paper screens varying dimensions of biomarker data, subsequently evaluating the association of the screened biomarkers with a health response of interest. The study extends upon the work by Boileau et al. (A flexible approach for predictive biomarker discovery) by integrating their proposed two-step method, involving their novel biomarker discovery method, uniCATE, into our analysis framework. This involved filtering predictive biomarkers using a threshold criterion and then applying CATE to the remaining features. These extensions were strategically implemented with the primary aim of advancing biomarker discovery research within the domain of biostatistics. Across our study, we observed that the methods we used for biomarker identification performed more effectively in scenarios characterized by less complex Data Generating Processes (DGPs), such as the Opposite Symmetric Linear DGP. Simplified relationships between variables within these DGPs produced more consistent and reliable identification of predictive biomarkers. In scenarios where the number of biomarkers (p) was 500 and the sample size (n) was either 100 or 500, our study achieved more accurate identification of the most predictive biomarkers. Particularly, when n was 500, the models consistently selected the correct biomarkers more frequently than when n was 100. The study highlights the significance of methodological approaches in identifying biomarkers and demonstrates the effectiveness of incorporating uniCATE and Conditional Average Treatment Effect (CATE) methods across different simulated scenarios.


A spatio-temporal analysis of avian influenza H5N1 outbreaks, focusing on the impact of season and climate

Author: Bandara, P.
Supervisor: Dean, C.

The spread of H5N1 avian influenza virus in Canada poses significant challenges for public health and ecological stability. This study assesses the spatial and temporal dynamics of H5N1 outbreaks among wild birds and mammals in Canada between 2021 and 2023, with a focus on statistical modelling that helps in understanding the impact of climate and season on outbreaks. Employing poisson, negative binomial, logistic, zero-truncated, zero-inflated poisson, and zero-inflated negative binomial models, we identify several count models that best fit the data. The model selection process was guided by statistical criteria such as Akaike Information Criterion(AIC), likelihood ratios, and assessments of overdispersion. An application of the space–time permutation scan statistic, which relies solely on case data without requiring population-at-risk figures, facilitated the identification of high-risk areas. These areas were mapped using ArcGIS for enhanced geographical visualization. This analysis concluded that the zero-inflated negative binomial model provided a fair fit for the H5N1 case data, highlighting significant overdispersion and a higher prevalence of zero counts than expected in a poisson distribution. Seasonality was identified as a key influence, with varying incidence rates across different seasons. Correlations were observed between H5N1 case counts and human population density, as well as environmental variables such as temperature and precipitation. The study also pinpointed specific geographical and temporal clusters where the risks of H5N1 outbreaks were statistically higher. This study offers valuable statistical insights into the dynamics of H5N1 spread in Canada. The findings highlight relevant disease patterns, aiding in the formulation of targeted and effective disease control strategies to mitigate the impact on both human health and wildlife.


Computational Tools for the Simulation and Analysis of Spike Trains

Author: Afable, JV.
Supervisor: Marriott, P.

This paper presents a set of tools and a workflow for replicating and modifying a spiking neural network simulation of the olfactory bulb using NEURON. Key concepts in computational neuroscience are first reviewed, including spike trains, neuron models, network architectures, and the biological circuitry of the olfactory bulb. The process of replicating an existing olfactory bulb simulation study is then described in detail. Modifications to the model are explored, investigating the effects of changing the random seed and adjusting mitral-granule cell network connectivity. Results demonstrate consistent network behavior across seeds, but a strong dependence of mitral and granule cell spiking activity on connectivity between these populations. The computational workflow establishes a framework for replicating and extending published neural simulations.


Developments in Neural Simulators in Computational Neuroscience

Author: Ladtchenko, V.
Supervisor: Marriott, P.

In this paper we will look at how In Silico studies have allowed scientists to minimize invasive procedures such as those required in neuroscience research. We will discuss simulation as an alternative and look into the inner workings of a simplified version of a modern simulator. Then we will discuss the mathematical modeling commonly used in simulators. We will look at deterministic and stochastic models, which are two ways of modeling brain neural networks. Next, we will look at simulators in detail and discuss their advantages and disadvantages. We will in particular focus on NEURON because it is the most popular simulator, and then on BrainPy which is a recent development. Finally, we will perform experiments on the NEURON and BrainPy simulators and find that we can (1) run the same model in BrainPy for 10,000 neurons in under 1 second, which runs in 1 minute for 1,000 neurons in NEURON, signifying a 600 times speed-up. (2) we will also show that we can run a simulation with up to 50,000,000 neurons using BrainPy, which we cannot do in NEURON.


Unveiling pitfalls and exploring alternatives in the use of pilot studies for sample size estimation

Author: Ji, C.
Supervisor: Zhu, Y.

Pilot studies are used to estimate effect sizes, which in turn are used in power calculations to determine the sample size needed for the main study to achieve a prespecified power and significance level. In this paper we explore the pitfalls of using small pilot studies to perform these estimates. Additionally, we examine three alternatives to determine a sufficient sample size needed for the main study, the corridor of stability, which utilizes bootstrapping to determine a sample size at which the estimate of the effect size will become stable, as well as two Bayesian metrics, the average coverage criterion, and the average length criterion, which involve controlling statistics based on the posterior distribution of the effect size. All three of these metrics are more robust than current methods to determine sample sizes and effect sizes from small pilot studies. Both Bayesian metrics are unaffected by sample size, and hence may be able to bypass the need for pilot studies altogether.


Implementable Portfolios to Mitigate Estimation Risk

Author: Kokic, S.
Supervisor: Weng, C.

The quest towards determining the best possible investment strategy in financial settings has been on-going for centuries. Introduced by Harry Markowitz in 1952, Modern Portfolio Theory (MPT) revolutionized the field of portfolio optimization, bringing a statistical intuition to performance evaluation. The so-called “Markowitz rule” is not without its flaws: sub-optimal returns in comparison to that of the true optimal portfolio, and requirements of a large number of assets with extensive history leave much to be desired. As a result, many different portfolio optimization frameworks have been proposed in the decades since Markowitz’s groundbreaking paper. This research paper investigates these alternative frameworks in two different investment universes: one with a risk-free asset considered in investment, and the other without. By leveraging historical return data and various estimation techniques, this paper examines the performance of different portfolio optimization rules under both risk-free and no risk-free asset scenarios. Empirical results using real-world data highlight the superiority of dynamic portfolio rules, such as the implementable three-fund rule and the Kan & Zhou [2022] rule, over traditional methods like the naive and plug-in rules. These findings underscore the importance of considering estimation risk, and adopting more sophisticated strategies to achieve higher returns while managing risk adequately.


Two-phase designs for bivariate failure time models using copulas

Author: Yuan, L.
Supervisor: Cook, R.

In health studies, two-phase designs are used to cost-effectively study the relationship between an expensive-to-measure biomarker and a response variable. Failure time data, aka survival data, also arises in many health and epidemiological studies, and may involve multiple dependent failure times for each individual. When studying the relationship between a biomarker and bivariate failure times, the dependence of the failure times also needs to be considered. This essay considers issues in the development of two-phase studies for studying the relationship between a biomarker and bivariate failure times, and when using copula models to accommodate the dependence between bivariate failure times.


Polya Trees for Right-Censored Data

Author: Zhao, Y.
Supervisor: Diao, L.

We estimate the survivor function based on right-censored survival data. Conventionally, the survival function can be estimated parametrically by assuming that the survival time follows a parametric distribution, e.g., exponential or Weibull distribution and estimating the parameters indexing the distribution through maximum likelihood method. Alternatively, it can be estimated nonparametrically using Kaplan-Meier (Kaplan & Meier, 1958) estimator. Parametric methods are efficient under correctly-specified model, however, lead to biased and invalid results if the distribution of survival time is misspecified. Nonparametric methods are robust and free of risk of misspecification but subject to loss of efficiency. The proposed Polya tree approach strikes a balance between the two. Polya trees are a convenient tool frequently adopted in nonparametric Bayesian literature to solve different problems. Muliere & Walker (1997) constructed Polya trees for right-censored data, designed tree structure that depends on observed data, and introduced priors that take partition length into consideration. Neath (2003) built tree structure that does not depend on data and modeled by using a mix of Polya trees. We introduce a probability allocation method which can work with data-dependent or data-independent partitions. We conduct intensitive simulation studies to assess the performance of the proposed method. The proposed Polya trees improve the performance or perform as well as the Polya trees proposed by Muliere & Walker (1997).


Measurement System Comparison using the Probability of Agreement Method with Assumption Violations

Author: Chan, B.
Supervisors: Stevens, N., Steiner, S.

In clinical and industry settings, we are often interested in assessing whether a new
measurement system may be used interchangeably with an existing one. The probability
of agreement method quantifies the agreement between two measurement systems, relying
on assumptions of normality and homoscedasticity. However, these assumptions are often
violated in practice. In this paper, we discuss the heteroscedastic probability of agreement
method proposed by Stevens et al. (2018), and explore the probability of agreement
method adapted for log-transformed data as an alternate approach to addressing assumption
violations. We compare and contrast the two approaches theoretically and empirically
through a case study of skewed heteroscedastic data.


Investigating the Performance of Direct and Indirect Causal Effect Estimators under Partial Interference and Structured Nearest Neighbour Interference

Author: Malnowski, V.
Supervisor: McGee, G.

In the framework of causal inference, interference occurs when one subject's treatment has a causal effect on another subject's potential outcomes. This indirect causal effect has shifted from from being viewed as a nuisance in the past to being the primary causal effect of interest in many contexts. This paper outlines the methods proposed by Tchetgen and VanderWeele (2012) to estimate the population average direct, indirect, total, and overall causal effects and to quantify their uncertainty in data exhibiting stratified partial interference using Hajek style IPW point estimators and sandwich form variance estimators. We then conduct a simulation study demonstrating these estimators consistently and efficiently estimate causal indirect effects not only in stratified partial interference settings, but also in data generated under structured nearest neighbour interference. We then apply the outlined methods and simulation study results to an agronomy dataset where we answer a relevant question from the literature regarding whether one crop's emergence date has a causal effect on another crop's grain yield by simultaneously testing for stratified partial interference and structured nearest neighbour interference.


Generative Methods for Causal Inference

Author: Zheng, S.
Supervisor: Diao, L.

Estimating the causal effect due to an intervention is important in many fields including education, marketing, health care, political science and online advertising. Causal inference is a field of study that focuses on understanding the cause-and-effect relationships between variables. Causal inference can be derived from both randomized controlled tri- als and observational studies. While randomized controlled trials are considered the gold standard for establishing causality, observational studies often require more sophisticated statistical methods to account for potential biases and confounding factors. A core concept in analyzing observational data is the notion of counterfactuals - what would have hap- pened to the same subject if they had been exposed to a different condition. In this essay, our discussion is also under the counterfactual framework. We study the advancements in integrating causal inference with deep learning, focusing on two prominent models: Causal Effect Variational Autoencoder (CEVAE) and Generative Adversarial Network for Inference of Treatment Effects (GANITE). We compare these two models through analyzing two real datasets and we find that GANITE consistently outperforms CEVAE in terms of performance metrics. Both CEVAE and GANITE exhibit areas for improvement. Future research should aim to combine the strengths of both models to develop more precise and robust approaches for causal inference, addressing the identified challenges and enhancing the accuracy of the methods.


Robustness and Efficiency Considerations when Testing Process Reliability with a Limit of Detection

Author: Bumbulis, L.
Supervisor: Cook, R.

Processes in biotechnology are considered reliable if they produce samples satisfying regulatory benchmarks. For example, laboratories may be required to show that levels of an undesirable analyte rarely (e.g. in less than 5% of samples) exceed a tolerance threshold. This can be challenging when measurement systems feature a lower limit of detection rendering some observations left-censored. In this paper we discuss the implications of detection limits for location-scale model-based inference in reliability studies, including their impact on large and finite sample properties of various estimators; power of tests for reliability and goodness of fit; and sensitivity of results to model misspecification. To improve robustness we then examine other approaches, including restricting attention to values above the limit of detection and using methods based on left-truncation, exact binomial tests, and a weakly parametric method where the right tail of the response distribution is approximated using a piecewise constant hazard model. This is followed by simulations to inform sample size selection in future reliability studies and an application to a study of residual white blood cell levels in transfusable blood products. We conclude with a brief discussion of our findings and some areas for future work.


Distribution of L1 distance in the unit hypercube

Author: Gajendragadkar, R.
Supervisor: Drekic, S.

We derive the exact distribution of the L_1 distance between two points sampled uniformly at random from an n−dimensional unit cube, and propose a hypothesis test based on this distribution for detecting dependence between the columns of a random matrix. Finally, we generalize the distribution each coordinate of the sampled points follows to a Beta(a, b) distribution and conjecture an asymptotic result. Several comparative plots are provided to demonstrate the obtained results.


A Comparison between Joint Modeling and Landmark Modeling for Dynamic Prediction

Author: An, S.
Supervisor: Dubin, J.

This research essay presents a comprehensive comparison between Joint Modeling (JM) and Landmark Modeling (LM) approaches for dynamic prediction in longitudinal data analysis. The study utilizes simulation studies and real-world data applications to evaluate the predictive performance of both models. The JM approach integrates a linear mixed- effects model for longitudinal biomarker measurements with a Cox proportional hazards model for survival data, providing a robust framework for dynamic predictions. In contrast, the LM approach updates prediction models at key time points using the latest longitudinal data, offering flexibility in handling time-varying covariates. Simulation results indicate that JM generally outperforms LM in predictive accuracy, particularly under conditions of high residual variance and long prediction horizons. However, LM demonstrates strengths in handling irregular measurement times and integrating short-term event information. The application to the Prothros dataset, involving patients with liver cirrhosis, illustrates the practical implications of both models. It highlights JM’s superior performance in the early years and LM’s variability in later years. This study underscores the importance of selecting appropriate models based on specific data characteristics and predictive goals. It suggests avenues for future research in non-linear trajectories and multi-biomarker integration to further enhance dynamic prediction methodologies.

 


Diagnostic test accuracy meta-analysis based on exact within-study variance estimation method

Author: Dabi, O.
Supervisor: Negeri, Z. 

A meta-analysis of diagnostic test accuracy (DTA) studies commonly synthesizes study-specific test sensitivity (Se) and test specificity (Sp) from different studies that aim to quantify the screening or diagnostic performance of a common index test of interest. A bivariate random effects model that utilizes the logit transformation of Se and Sp and accounts for the within-study and between-study heterogeneity is commonly used to make statistical inferences about the unknown test characteristics. However, it is well reported that this model may lead to misleading inference since it employs the logit transformation and approximate within-study variance estimate. Alternative transformations which do not require continuity corrections such as the arcsine square root and Freeman-Tukey double arcsine were recently proposed to overcome the former limitation. However, these solutions also suffer from using approximate within-study variance estimates, which can only be justified when within-study sample sizes are large. To overcome these problems, we propose an exact within-study variance estimation approach which does not require a continuity correction and is invariant to transformations. We evaluate the proposed method compared to the existing approach using real-life and simulated meta-analyses of DTA data. Our findings indicate that both methods perform comparably when there are no zero cell counts in the DTA data and the sample sizes (the numbers of diseased and non-diseased individuals) per study are large. The approximate method significantly underestimates the summary Se and Sp, especially when the true Se and Sp pairs are closer to 1. However, the analytical method has better bias, root mean squared error (RMSE), confidence interval (CI) width, and coverage probability for Se and Sp when the true Se and Sp are large. Similar results are found when comparing the methods in terms of the between-study variance-covariance parameters. Therefore, researchers and practitioners can use either of the within-study variance estimation methods for aggregate data meta-analysis (ADMA) of (DTA) studies without zero cell counts and large within-study sample sizes. Conversely, the analytical method should be preferred over the approximate technique for ADMA of DTA studies with zero cell counts or small within-study sample sizes.


Imputation Approach to Missing Data and Causal Inference

Author: Huang, Z.
Supervisor: Wu, C.

We provide a critical review of imputation approach to missing data analysis and causal inference. We present general settings and methodologies for each topic, discuss key assumptions for the validity of the methods, and highlight the connections and common features of these two seemingly distinct areas under the unified framework for imputation-based methods. Our simulation studies substantiate the practicality of applying imputation techniques originally developed for missing data to estimate average treatment effects in causal inference, demonstrating their effectiveness and versatility.


A Review and Comparison of Multiple Testing Procedures.

Author: Wu, R. P.
Supervisor: Stevens, N.

This paper provides a broad comparison and review of various multiple testing procedures ranging from classical methods such as the Bonferroni correction, Holm’s procedure, Hommel’s procedure, and Hochberg’s procedure, to more recent methods like the PAAS procedure, the Fallback, and the 4A procedure. We evaluate the performance of these procedures through simulation studies, considering various levels of marginal power, correlation, and the number of well-powered and under-powered endpoints. Our simulation results reveal that although the Bonferroni correction is overly conservative, the practical difference in its empirical power compared to the other methods is small in many settings. However, more advanced procedures, such as the Fallback and 4A procedure, achieve higher empirical power but at the loss of simplicity and interpretability. In contrast, the General Multistage Gatekeeping (GMG) procedure, which groups hypotheses into families based on criteria such as endpoint importance, demonstrates lower empirical power compared to other methods in the context of multiple endpoints. The results and insights gleaned from this paper underscore the importance of choosing an appropriate multiple testing procedure based on the specific use case. Our findings suggest that while more advanced methods can ensure control over the family-wise error rate, they introduce an added layer of complexity both in application and interpretation. This paper aims to serve as a guideline and ‘play-book’ for researchers and industry professionals in selecting the right multiple testing procedure for their respective circumstances.


dWOLS precision medicine implementation with measurement error from treatment non-adherence

Author: Mawer, K.
Supervisor: Wallace, M.

Precision medicine tailors treatments based on details about a patient to optimize the response. Moreover, a dynamic treatment regime (DTR) is a formalized application of precision medicine that incorporates a treatment rule or a series of treatment rules. We want to find the optimal DTR that maximizes an outcome. We use dWOLS, a method of DTR estimation that weighs atypical treatment plans more heavily. Ideally, patients fully adhere to the treatment they are prescribed. However, people may not adhere to this treatment, which results in measurement error with the treatment variate, as the person's actual treatment will differ from the prescribed treatment. As people may not adhere due to unwillingness to undergo side effects or forgetfulness, we define variables based on these as personality variables relating to openness and conscientiousness, respectively. For our treatment variate, we assume that we have an experimental and control treatment, where people who do not adhere are treated as though they take the control treatment. Despite how measurement error can affect DTR estimation, we can use dWOLS to model treatments even when there is measurement error in treatments. We simulate the effects of non-adherence to understand the bias and other potential problems, including when the non-adherence depends on the variates we use when we tailor the treatment. We can rectify the bias with the estimators when there is non-adherence, with rectifications being easier when this non-adherence is independent of the tailoring variate.


Measurement Error in the Tailoring Covariate and Its Association with Group Membership

Author: Sivathayalan, J.
Supervisor: Wallace, M.

Dynamic treatment regimes provide a framework for providing personalized interventions for a given condition, but their construction relies on error-prone measurements of covariates and treatments. Such measurement error may be associated with individuals' membership in certain groups, such as sociodemographic categories, and therefore affect the adequacy of the treatment(s) they are given. Dynamic weighted ordinary least squares has been established as a doubly robust method of estimation for dynamic treatment regimes, with a relatively straightforward implementation; this paper explores its use in a single-stage regime, with a sample consisting of individuals from groups measured with varying amounts of error. Simulation results show better accuracy in treatment assignment for those in a group measured with less error, and that greater sample size and a lower magnitude of error generally lead to improved accuracy. They demonstrate the impact of measurement error in a setting where its effect varies based on group membership, and how this can affect the quality of treatment received.


Spatio-temporal Data Analysis

Author: Ge, R.
Supervisor: Dubin, J.

The paper explores the challenges and methodologies involved in analyzing spatio-temporal data, which is increasingly generated from various sources such as remote sensing, mobility data, wearable devices, and social media. Spatio-temporal data, characterized by its spatial and temporal components, requires sophisticated analytical methods due to its complexity and the inherent spatial autocorrelation. Significant challenges in spatiotemporal data analysis include handling errors from missing observations, systematic biases, and measurement inaccuracies. The integration of spatial and temporal database models into unified spatio-temporal models has been a focus of recent research, aiming to improve the practical application and development of these models.

Bayesian hierarchical models are emphasized for their ability to incorporate time and area effects, providing insights through the interpretability of neighborhood structures and adjacent times. However, these models traditionally rely on Markov Chain Monte Carlo methods, which are computationally intensive. This research essay presents the Integrated Nested Laplace Approximation (INLA) as a computationally efficient alternative for Bayesian analysis, especially suitable for latent Gaussian models. INLA offers significant computational advantages, providing precise estimates in a fraction of the time required by traditional methods. Additionally, the paper discusses the application of Generalized Linear Mixed Effects Models (GLMMs), which have gained popularity for modeling spatio-temporal data due to their flexibility in handling different types of data and accounting for spatial random effects. The GLMM framework is capable of capturing the correlation between observations over time and space, making it a valuable tool for spatio-temporal data analysis. In this essay, I highlight the potential of Bayesian hierarchical models and GLMMs, alongside computational advancements like INLA, to enhance the accuracy and efficiency of spatiotemporal data analysis. The study suggests avenues for methodological refinement and emphasizes the need for careful prior selection to ensure reliable estimates in practical applications. Future research should incorporate real-world data and explore more complex spatial-temporal correlations to enhance the applicability and robustness of INLA models.


Sequential Tennis: A Tennis Engine for Coaching, Commentary, Evaluation, and Simulation

Author: Wang, C. 
Supervisor: Drekic, S.

This research paper introduces a novel method to model and simulate tennis rallies, using the sequential nature of tennis to abstract a rally into a game tree while preserving key components such as shot trajectories, hitting windows, player movement speed, and shot risk. The resulting game complexity is estimated with respect to various metrics and compared to other sequential games such as chess and Go. After the model is constructed, a modified version of the negamax algorithm incorporating risk can be applied to obtain an engine which evaluates player decision making and recommends optimal strategies and tactics. This is demonstrated with a case study of how a rally can develop when a server employs a kick serve, showcasing potential applications in coaching, match commentary, player evaluation, and simulation for game development and match prediction.


Analysis of Multi-Server Queuing Systems with Batch Arrivals: Applications in Insurance Claim Processing

Author: Zhang, Z.
Supervisor: Drekic, S.

This report presents a comprehensive analysis of a multi-server queuing system with batch arrivals, focusing on its application in insurance claim processing. We investigate a model where claims arrive in batches at regular intervals and are processed by adjusters with exponentially distributed service times. The study covers general, single-server, and infinite-server cases, deriving expressions for unprocessed claims per batch and total unprocessed claims. A key finding reveals that under the single-server case, when batch sizes follow a Discrete Phase-type (DPH) distribution, unprocessed claims per batch also follows a DPH distribution, enabling the application of matrix-analytic methods. We explore system behavior under various conditions, discuss practical implications for insurance claim processing, and address computational aspects for large-scale systems. The analysis provides insights into resource allocation, system efficiency, and performance optimization. The report concludes by identifying areas for future research, contributing to the broader understanding of batch processing in queuing theory and its real-world applications.


GAM SymbolicGPT: A Generalized Additive Model Approach to SymbolicGPT

Author: Zhang, D.
Supervisor: Ghodsi, A.

Symbolic regression is the process of deriving a mathematical expression that best de- scribes the underlying relationship between a set of input and output variables. While deep learning-based approaches have achieved significant success in this domain, they of- ten struggle with high-dimensional input data due to the immense search space. In this work, to address the issue of high dimensionality, we extend an existing deep learning method, SymbolicGPT, by proposing a novel algorithm, GAM SymbolicGPT, inspired by the backfitting algorithm used for fitting generalized additive models. Through experimen- tation, we highlight the limitations of our method in addressing high-dimensional symbolic regression tasks.


Meta-Modeling for Fair Fee Determination in Registered Index-Linked Annuities (RILAs)

Author: Quan, H.
Supervisor: Feng, B.

This paper introduces a novel approach to analyzing Registered Index-Linked Annuities (RILAs), an emerging financial product that blends the features of Variable Annuities (VAs) with simpler characteristics. This unique combination facilitates the application of advanced meta-modeling techniques. We developed a comprehensive simulation model to evaluate RILA’s performance, drawing from a compressor, simulator, predictor structure. The model incorporates diverse factors such as smoking behavior, residency, and age, and we adapt the objective for enhanced industry applicability. Firstly, we propose a regression- based method to determine the fair fees for labeling RILA contracts for predictor use. Secondly, we utilize various predictors to compare and assess the model’s performance.


Differential Privacy: a Survey and Review

Author: Qin, Y. 
Supervisor: Chenouri, S.

In an era where data privacy has become increasingly crucial, differential privacy emerges as a leading framework for safeguarding individual information while enabling the analysis of large datasets. This paper presents a brief survey and review of differential privacy, exploring its foundational principles, key mechanisms, and practical applications. We examine the theoretical underpinnings of differential privacy, including the central and local models, and discuss the trade-offs between privacy guarantees and data utility. Through a brief introduction of common techniques such as the Laplace and Gaussian mechanisms, we highlight their effectiveness in various contexts and their flexibility, versatility in statistical analysis. Drawing on insights from existing literature, we also briefly discuss future directions for differential privacy research.


Investigating the Performance and Parametrization of a Multiscale Spike Train Model

Author: Das, K.
Supervisor: Ramezan, R.

The brain, as the central organ of the nervous system, controls various complex processes through the activity of neurons, which communicate via sequences of consecutive action potentials called spike trains. This paper evaluates the performance of estimation algorithms and software for two multiscale models for the intensity function of an inhomogeneous Poisson process proposed by Ramezan and his colleagues in 2014 for neural spike trains. Through simulations, we focus on the recovery of known multiscale intensity functions with one or two periodic components. We will also tackle the dimensionality issues of these models. Simulation results demonstrate that while the smoothed periodogram effectively identifies the original frequency values and initial phases, challenges arise in accurately estimating the models’ parameters. Significant trial-to-trial variability indicates that the models struggle to provide low-variance parameter estimates across trials. Studying the Fisher Information matrix, we have observed “practical unidentifiability” in these models which is defined for cases where the loglikelihood is theoretically curved, but the curvature is too small for meaningful joint inference about the parameters.


Mimic Modelling of Expected Goals in Soccer Analysis

Author: Owusu Boateng, B.
Supervisor: Davis, M.J.

Expected goals (xG) have become a key metric in modern soccer analytics, providing a reliable way to evaluate the quality of scoring opportunities. This study investigates the factors that influence xG using shot location data from the 2014 La Liga season. By applying a mixed-effects modeling approach, we analyze how factors such as shot distance, shot type, game situation, and the influence of individual players and teams affect the probability of scoring. The dataset includes approximately 90,000 shot attempts, capturing detailed information on shot coordinates, shot type (e.g., right foot, header, penalty), game context (e.g., open play, set pieces), and player and team identifiers. To address the complexity of soccer, mixed-effects models were employed to account for both fixed effects—such as shot distance and game situations—and random effects tied to players and teams, which capture variability at both individual and team levels. The results show that shot distance has a strong inverse relationship with scoring probability, with shots taken closer to the goal resulting in higher xG values. Certain shot types, such as penalties, significantly increase xG, while headers exhibit greater variability in their effectiveness. Game situations, such as open play versus set pieces, also reveal distinct patterns in scoring likelihood, reflecting differences in tactical approach. The inclusion of random effects highlights the importance of player and team-specific factors, indicating that individual skill and team strategy play crucial roles in determining shot outcomes. This study demonstrates the value of mixed-effects models in providing a detailed understanding of the factors that influence scoring in soccer. By accounting for both fixed and random effects, the model offers a thorough analysis of shot effectiveness and variability across players and teams. These findings have practical applications for coaches and analysts seeking to refine training methods and optimize tactics for winning soccer matches. The research also suggests opportunities for further exploration, particularly by expanding the dataset to multiple seasons or incorporating additional factors such as player positioning, opponent pressure, and match dynamics to enhance the predictive accuracy of xG models.


Examining Violations of the Poisson Process Assumption in Ice Hockey Goals

Author: Uchendu, C.
Supervisor: Davis, M.J.

The Poisson process is a foundational framework for modeling event timings. Still, its assumptions of constant rate and independence are often violated in the dynamic context of ice hockey goal scoring. This study critically examines these violations using play-by-play data from the 2022–2023 NHL season. By analyzing the inter-goal times (time between goals) across different game periods and the entire game, the study evaluates alternative statistical models, including the exponential, mixture of exponentials, gamma, generalized gamma, and beta distributions, to address the limitations of the Poisson process. These models were fitted using maximum likelihood estimation (MLE) and evaluated with model selection criteria (AIC and BIC). Censoring adjustments were implemented to account for truncated data due to fixed game durations. Results indicate that the Poisson process fails to adequately describe goal intervals, with significant deviations observed across periods. Among the tested models, the (censored) generalized gamma distribution consistently outperformed others, capturing the heavy-tailed and bursty nature of goal-scoring intervals. The beta distribution also showed strong performance in modeling time-bounded periods. These findings highlight the need for more flexible models in hockey analytics to account for variability and dependencies in goal scoring. Practical implications include applications in coaching strategies and sports betting, where accurate modeling of scoring dynamics can provide a strategic advantage. Future research should incorporate contextual and spatial-temporal factors to further enhance model precision.