Department of Statistics and
Actuarial Science (SAS)
Mathematics 3 (M3)
University of Waterloo
Administrative Staff Directory
Phone: 519-888-4567, ext. 33550
Back to the future: why I think REGRESSION is the new black in genetic association studies
Linear regression remains an important framework in the era of big and complex data. In this talk I present some recent examples where we resort to the classical simple linear regression model and its celebrated extensions in novel settings. The Eureka moment came while reading Wu and Guan's (2015) comments on our generalized Kruskal-Wallis (GKW) test (Elif Acar and Sun 2013, Biometrics). Wu and Guan presented an alternative “rank linear regression model and derived the proposed GKW statistic as a score test statistic", and astutely pointed out that “the linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem." More recently, we turned our attention to extending Levene's variance test for data with group uncertainty and sample correlation. While a direct modification of the original statistic is indeed challenging, I will demonstrate that a two-stage regression framework makes the ensuing development quite straightforward, eventually leading to a generalized joint location-scale test (David Soave and Sun 2017, Biometrics). Finally, I will discuss on-going work, with graduate student Lin Zhang, on developing an allele-based association test that is robust to the assumption of Hardy-Weinberg equilibrium and is generalizable to complex data structure. The crux of this work is, again, reformulating the problem as a regression!
Probability models for discretization uncertainty with adaptive grid designs for systems of differential equations
When models are defined implicitly by systems of differential equations without a closed form solution, small local errors in finite-dimensional solution approximations can propagate into large deviations from the true underlying state trajectory. Inference for such models relies on a likelihood approximation constructed around a numerical solution, which underestimates posterior uncertainty. This talk will introduce and discuss progress in a new adaptive formalism for modeling and propagating discretization uncertainty through the Bayesian inferential framework, allowing exact inference and uncertainty quantification for discretized differential equation models.
Dual representations of risk measures on Orlicz spaces
The standard theory of risk measures, developed for bounded positions, asserts that any coherent risk measure with the Fatou property can be represented as the worst expectation over a class of probabilities. In this talk, we will discuss possible extensions of this result when the space of financial positions is taken to be an Orlicz space. We show that the representation fails in general and remains valid if the risk measure possess additional properties (e.g., law-invariance, strong Fatou property).
Quantile regression with nominated samples for more efficient and less expensive follow-up studies of bone mineral density
We develop a new methodology for analyzing upper and/or lower quantiles of the distribution of bone mineral density using quantile regression. Nomination sampling designs are used to obtain more representative samples from the tails of the underlying distribution. We propose new check functions to incorporate the rank information of nominated samples in the estimation process. Also, we provide an alternative approach that translates estimation problems with nominated samples to corresponding problems under simple random sampling (SRS). Strategies are given to choose proper nomination sampling designs for a given population quantile. We implement our results to a large cohort study in Manitoba to analyze quantiles of bone mineral density using available covariates. We show that in some cases, methods based on nomination sampling designs require about one tenth of the sample used in SRS to estimate the lower or upper tail conditional quantiles with comparable mean squared errors. This is a dramatic reduction in time and cost compared with the usual SRS approach.
Analysis of Generalized Semiparametric Mixed Varying-Coefficient Effects Model for Longitudinal Data
The generalized semiparametric mixed varying-coefficient effects model for longitudinal data that can flexibly model different types of covariate effects. Different link functions can be selected to provide a rich family of models for longitudinal data. The mixed varying-coefficient effects model accommodates constant effects, time-varying effects, and covariate-varying effects. The time-varying effects are unspecified functions of time and the covariate-varying effects are nonparametric functions of a possibly time-dependent exposure variable. We develop the semiparametric estimation procedure by using local linear smoothing and profile weighted least squares estimation techniques. The method requires smoothing in two different and yet connected domains for time and the time-dependent exposure variable. The estimators of the nonparametric effects are obtained through aggregations to improve efficiency. The asymptotic properties are investigated for the estimators of both nonparametric and parametric effects. Some hypothesis tests are developed to examine the covariate effects. The finite sample properties of the proposed estimators and tests are examined through simulations with satisfactory performances. The proposed methods are used to analyze the ACTG 244 clinical trial to investigate the effects of antiretroviral treatment switching in HIV infected patients before and after developing the codon 215 mutation.
New developments in survival forests techniques
Survival analysis answers the question of when an event of interest will happen. It studies time-to-event data where the true time is only observed for some subjects and others are censored. Right-censoring is the most common form of censoring in survival data. Tree-based methods are versatile and useful tools for analyzing survival data with right-censoring. Survival forests, that are ensembles of trees for time-to-event data, are powerful methods and are popular among practitioners. Current implementations of survival forests have some limitations. First, most of them use the log-rank test as the splitting rule which loses power when the proportional hazards assumption is violated. Second, they work under the assumption that the event time and the censoring time are independent, given the covariates. Third, they do not provide dynamic predictions in presence of time-varying covariates. We propose solutions to these limitations: We suggest the use of the integrated absolute difference between the two children nodes survival functions as the splitting rule for settings where the proportionality assumption is violated. We propose two approaches to tackle the problem of dependent censoring with random forests. The first approach is to use a final estimate of the survival function that corrects for dependent censoring. The second one is to use a splitting rule which does not rely on the independent censoring assumption. Lastly, we make recommendations for different ways to obtain dynamic estimations of the hazard function with random forests with discrete-time survival data in presence of time-varying covariates. In our current work, we are developing forest for clustered survival data.
Statistics meets the protein folding problem: fast exploration of conformations with sequential Monte Carlo
The problem of predicting the 3-D structure of a protein from its amino acid sequence using computer algorithms has challenged scientists for nearly a half century. The structure of a protein is essential for understanding its function, and hence accurate structure prediction is of vital importance in modern applications such as protein design in biomedicine. A powerful approach for structure prediction is to search for the conformation of the protein that has minimum potential energy. However due to the size of the conformational space, efficient exploration remains a bottleneck for energy-guided computational methods even with the aid of known structures in the Protein Data Bank. In this talk, I will first introduce this exploration problem from a statistical perspective. Then, I will present a new method for building segments of protein structures that is inspired by sequential Monte Carlo and enables faster exploration than existing methods. Finally, we apply the method to examples of real proteins and demonstrate its promise for improving the low confidence segments of 3-D structure predictions.
Modern Classification with Big Data
Rapid advances in information technologies have ushered in the era of "big data" and revolutionized the scientific research. Big data creates golden opportunities but has also arisen unprecedented challenges due to the massive size and complex structure of the data. Among many tasks in statistics and machine learning, classification has diverse applications, ranging from improving daily life to reaching the new frontiers of science and engineering. This talk will discuss the envisions of broader approaches to modern classification methodologies, as well as computational considerations to cope with the big data challenges. I will present a modern classification method named data-driven generalized distance-weighted discrimination. A fast algorithm with an emphasis on computational efficiency for big data will be introduced. Our method is formulated in a reproducing kernel Hilbert space, and learning theory of the Bayes risk consistency will be developed. In addition, I will use extensive benchmark data applications to demonstrate that the prediction accuracy of our method is highly competitive with state-of-the-art classification methods including support vector machine, random forest, gradient boosting, and deep neural network.
This seminar has been cancelled.
Inference for statistical interactions under misspecified or high-dimentional main effects
An increasing number multi-omic studies have generated complex high-dimentional data. A primary focus of these studies is to determine whether exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in interactions, the standard approach is not satisfactory because it is prone to (possibly severe) type I error inflation when the main exposure effects are misspecified or high-dimentional. I will propose generalized score type tests for high-dimentional interaction effects on correlated outcomes. I will also discuss the theoretical justification of some empirical observations regarding Type I error control, and introduce solutions to achieve robust inference for statistical interactions. The proposed methods will be illustrated using an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with four exams.
Parametric and Nonparametric Models for Higher-order Interactions.
In this talk, I will discuss about parametric and nonparametric models for higher-order interactions with a focus on the statistical and computational aspects. In fields like social, political and biological sciences, there is a clear need for analyzing higher-order interactions as opposed to pairwise interactions, which has been the main focus of statistical network analysis recently. Generalized Block Models and Hypergraphons are powerful tools for modeling higher-order interactions. The talk will introduce the models, present theoretical results highlighting the challenges and differences that arise when analyzing higher-order interactions compared to pairwise interactions, and discuss applications and numerical results.
Nonparametric Inference for Sensitivity of Haezendonck-Goovaerts Risk Measure
Recently Haezendonck-Goovaerts (H-G) risk measure has been popular in actuarial science. When it is applied to an insurance or a financial portfolio with several loss variables, sensitivity analysis becomes useful in managing the portfolio, and the assumption of independent observations may not be reasonable. This paper first derives an expression for computing the sensitivity of the H-G risk measure, which enables us to estimate the sensitivity nonparametrically via the H-G risk measure. Further, we derive the asymptotic distributions of the nonparametric estimators for the H-G risk measure and the sensitivity by assuming that loss variables in the portfolio follow from a strictly stationary ↵-mixing sequence. A simulation study is provided to examine the finite sample performance of the proposed nonparametric estimators. Finally, the method is applied to a real data set. Key words and phrases: Asymptotic distribution, Haezendonck-Goovaerts risk measure, Mixing sequence, Nonparametric estimate, Sensitivity analysis
Competitive Equilibria in a Comonotone Market
The notion of competitive equilibria has been a crucial consideration in risk sharing problems. A large literature is devoted to analyses of optimal risk sharing based on expected utilities in a complete market. In this work, we investigate the competitive equilibria in a special type of incomplete markets, referred to as a comonotone market, where agents can only trade such that their wealth allocation is comonotonic. The comonotone market is motivated by two seemingly unrelated observations. First, in a complete market, under mild conditions on the preferences, an equilibrium allocation is generally comonotonic. Second, in a standard insurance market, the allocation of risk among the insured, the insurer and the reinsurers is assumed to be comonotonic a priori to the risk-exchange. Two popular classes of preferences in risk management and behavioural economics, dual utilities (DU) and rank-dependent expected utilities (RDU), are used to formulate agents' objectives. We focus on establishing a pair of an equilibrium wealth allocation and an equilibrium pricing measure. For DU-comonotone markets, we nd the equilibrium in closed-form. We further propose an algorithm to numerically obtain a competitive equilibria based on discretization, which works for both the DU-comonotone market and the RDU-comonotone market. Results illustrate the intriguing and possibly puzzling fact that the equilibrium pricing kernel may not be counter-comonotone with the aggregate risk, in sharp contrast to the case of a complete market.
Community Estimation on Weighted Networks
Community identification in a network is an important problem in fields such as social science, neuroscience, and genetics. Over the past decade, stochastic block models (SBMs) have emerged as a popular statistical framework for this problem. However, SBMs have an important limitation in that they are suited only for networks with unweighted edges; disregarding the edge weights may result in a loss of valuable information in various scientific applications. We propose a weighted generalization of the SBM where we model the probability distribution of the edge weights as a mixture whose latent components reflect the latent community structure of the network. In this model, observations comprise of a weighted adjacency matrix where the weight of each edge is generated independently from one of two unknown probability densities depending on whether the edge is within-community or between-community. We characterize the optimal rate of mis-clustering error of the weighted SBM in terms of the Renyi divergence between the probability distributions of within-community and between-community edges, substantially generalizing existing results for unweighted SBMs. Furthermore, we present a computationally tractable algorithm based on discretization that is adaptive to the unknown edge weight densities in the sense that it achieves the same optimal error rate as if it had perfect knowledge of the edge weight densities.
Testing the multivariate regular variation model for extreme risks
Heavy-tail phenomena generally exist in insurance, finance and economics. Multivariate regular variation (MRV) is one of the most important structures in modeling multivariate extreme risks with heavy-tailed marginal distributions and flexible dependence structures. In this paper, we propose a formal goodness-of-fit test for the MRV model. The test is based on comparing the tail indices of the radial component conditional on the angular component falling in different subsets. We first establish the estimator of the conditional tail index and prove the joint asymptotic property for all such estimators. We further combine the test on the constancy across different conditional tail indices with testing the regular variation of the radial component. Our proofs are based on the asymptotic properties of tail and non-tail empirical processes. Simulation studies demonstrate the good performance of the proposed tests, and real market data applications are also provided.
Sparse Estimation for Functional Semiparametric Additive Models
In the context of functional data analysis, functional linear regression serves as a fundamental tool to handle the relationship between a scalar response and a functional covariate. With the aid of Karhunen–Loève expansion of a stochastic process, a functional linear model can be written as an infinite linear combination of functional principal component scores. A reduced form is fitted in practice for dimension reduction; it is essentially converted to a multiple linear regression model.
Though the functional linear model is easy to implement and interpret in applications, it may suffer from an inadequate fit due to this specific linear representation. Additionally, effects of scalar predictors which may be predictive of the scalar response are neglected in the functional linear model.
Prediction accuracy can be enhanced greatly by incorporating effects of these scalar predictors.
In this talk, we propose a functional semiparametric additive model, which models the effect of a functional covariate nonparametrically and models several scalar covariates in a linear form. We develop the method for estimating the functional semiparametric additive model by smoothing and selecting non-vanishing components for the functional covariate. We show that the estimation method can consistently estimate both nonparametric and parametric parts in the model. Numerical studies will be presented to demonstrate the advantage of the proposed model in prediction.
Statistical Methods for The Analysis of Censored Family Data under Biased Sampling Schemes
Studies of the genetic basis for chronic disease often first aim to examine the nature and extend of within-family dependence in disease status. Families for such studies are typically selected using a biased sampling scheme in which affected individuals are recruited from a disease registry, followed by their consenting relatives. This gives right-censored or current status information on disease onset times. Methods for correcting this response-dependent sampling scheme have been developed for correlated binary data but variation in the age of assessment for family members makes this analysis uninterpretable. We develop likelihood and composite likelihood methods for modeling within-family associations in disease onset time using copula functions and second-order regression models in which dependencies are characterized by Kendall’s τ. Auxiliary data from an independent sample of individuals can be integrated by augmenting the composite likelihood to ensure identifiability and increase efficiency. An application to a motivating family study in psoriatic arthritis illustrates the method and provides evidence of excessive paternal transmission of risk. Ongoing work on the use of second-order estimating functions, alternative framework for dependence modeling, and approaches to efficient study design will also be discussed.
Department of Statistics and
Actuarial Science (SAS)
Mathematics 3 (M3)
University of Waterloo
Administrative Staff Directory
Phone: 519-888-4567, ext. 33550
The University of Waterloo acknowledges that much of our work takes place on the traditional territory of the Neutral, Anishinaabeg and Haudenosaunee peoples. Our main campus is situated on the Haldimand Tract, the land promised to the Six Nations that includes six miles on each side of the Grand River. Our active work toward reconciliation takes place across our campuses through research, learning, teaching, and community building, and is centralized within our Indigenous Initiatives Office.