Current students

While a high school student in Toronto, Maysum Panju knew that the University of Waterloo was the destination for math. While working on his undergraduate degree and Master’s in computational math, he started learning about machine learning. That led to his interests in developing algorithms and theoretical proofs and he decided to start his PhD in statistics.

University of Waterloo Faculty to Mathematics researchers have developed a new method that enables large insurers to reduce the time spent estimating the financial liabilities of their portfolios from days to hours while achieving high accuracy.

A study details the new method which significantly reduces computational time, but still estimates the financial liability of variable annuity portfolios accurately for business purposes.

Statistics and Actuarial Science PhD candidate Yilin Chen is one of two students to claim the 2020 Huawei Prize for Best Research paper by a Mathematics Graduate Student. The $4,000 prize affirms the value of Chen’s efforts to establish a framework for analyzing nonprobability survey samples in her winning paper: Doubly Robust Interference with Nonprobability Survey Samples.

Friday, February 7, 2020 10:00 am - 10:00 am EST (GMT -05:00)

Department seminar by Gabriel Becker, University of California Davis

The Extended Reproducibility Phenotype - Re-framing and Generalizing Computational Reproducibility

Computational reproducibility has become a crucial part of how data analytic results are understood and assessed both in and outside of academia. Less work, however, has explored whether these strict computational reproducibility criteria are necessary or sufficient to actually meet our needs as consumers of analysis results. I will show that in principle they are neither. I will present two inter-related veins of work. First, I will provide a  conceptual reframing of the concepts of strict reproducibility, and the actions analysts take to ensure it, in terms of our ability to actually trust the results and the claims about the underlying data-generating systems they embody. Second, I will present a generalized conception of reproducibily by introducing the concepts of Currency, Comparability and Completeness and their oft-overlooked importance to assessing data analysis results.

Thursday, February 6, 2020 10:00 am - 10:00 am EST (GMT -05:00)

Department seminar by Liqun Diao, University of Waterloo

Censoring Unbiased Regression Trees and Ensembles

Tree-based methods are useful tools to identify risk groups and conduct prediction by employing recursive partitioning to separate subjects into different risk groups. We propose a novel paradigm of building regression trees for censored data in survival analysis. We prudently construct the censored-data loss function through an extension of the theory of censoring unbiased transformations. With the construction, we can conveniently implement the proposed regression trees algorithm using existing software for the Classification and Regression Trees algorithm (e.g., rpart package in R) and extend it for ensemble learning. Simulations and real data examples demonstrate that our methods either improve upon or remain competitive with existing tree-based algorithms for censored data.

Wednesday, February 5, 2020 10:00 am - 10:00 am EST (GMT -05:00)

Department seminar by David Kepplinger, University of British Columbia

Detecting the Signal Among Noise and Contamination in High Dimensions

Improvements in biomedical technology and a surge in other data-driven sciences lead to the collection of increasingly large amounts of data. In this affluence of data, contamination is ubiquitous but often neglected, creating substantial risk of spurious scientific discoveries. Especially in applications with high-dimensional data, for instance proteomic biomarker discovery, the impact of contamination on methods for variable selection and estimation can be profound yet difficult to diagnose.

In this talk I present a method for variable selection and estimation in high-dimensional linear regression models, leveraging the elastic-net penalty for complex data structures. The method is capable of harnessing the collected information even in the presence of arbitrary contamination in the response and the predictors. I showcase the method’s theoretical and practical advantages, specifically in applications with heavy-tailed errors and limited control over the data. I outline efficient algorithms to tackle computational challenges posed by inherently non-convex objective functions of robust estimators and practical strategies for hyper-parameter selection, ensuring scalability of the method and applicability to a wide range of problems.

Thursday, January 30, 2020 10:00 am - 10:00 am EST (GMT -05:00)

Department seminar by Hyukjun (Jay) Gweon, Western University

Batch-mode active learning for regression and its application to the valuation of large variable annuity portfolios

Supervised learning algorithms require a sufficient amount of labeled data to construct an accurate predictive model. In practice, collecting labeled data may be extremely time-consuming while unlabeled data can be easily accessed. In a situation where labeled data are insufficient for a prediction model to perform well and the budget for an additional data collection is limited, it is important to effectively select objects to be labeled based on whether they contribute to a great improvement in the model's performance. In this talk, I will focus on the idea of active learning that aims to train an accurate prediction model with minimum labeling cost. In particular, I will present batch-mode active learning for regression problems. Based on random forest, I will propose two effective random sampling algorithms that consider the prediction ambiguities and diversities of unlabeled objects as measures of their informativeness. Empirical results on an insurance data set demonstrate the effectiveness of the proposed approaches in valuing large variable annuity portfolios (which is a practical problem in the actuarial field). Additionally, comparisons with the existing framework that relies on a sequential combination of unsupervised and supervised learning algorithms are also investigated.

Friday, January 24, 2020 10:00 am - 10:00 am EST (GMT -05:00)

Department seminar by Michael Gallaugher, McMaster University

Clustering and Classification of Three-Way Data

Clustering and classification is the process of finding and analyzing underlying group structure in heterogenous data and is fundamental to computational statistics and machine learning. In the past, relatively simple techniques could be used for clustering; however, with data becoming increasingly complex, these methods are oftentimes not advisable, and in some cases not possible. One such such example is the analysis of three-way data where each data point is represented as a matrix instead of a traditional vector. Examples of three-way include greyscale images and multivariate longitudinal data. In this talk, recent methods for clustering three-way data will be presented including high-dimensional and skewed three-way data. Both simulated and real data will be used for illustration and future directions and extensions will be discussed.