Department of Statistics and
Actuarial Science (SAS)
Mathematics 3 (M3)
University of Waterloo
Administrative Staff Directory
Phone: 519-888-4567, ext. 43550
Space-filling Designs for Computer Experiments and Their Application to Big Data Research
Computer experiments provide useful tools for investigating complex systems, and they call for space-ﬁlling designs, which are a class of designs that allow the use of various modeling methods. He and Tang (2013) introduced and studied a class of space-ﬁlling designs, strong orthogonal arrays. To date, an important problem that has not been addressed in the literature is that of design selection for such arrays. In this talk, I will ﬁrst give a broad introduction to space-ﬁlling designs, and then present some results on the selection of strong orthogonal arrays.
The second part of my talk will present some preliminary work on the application of space-ﬁlling designs to big data research. Nowadays, it is challenging to use current computing resources to analyze super-large datasets. Subsampling-based methods are the common approaches to reducing data sizes, with the leveraging method (Ma and Sun, 2014) being the most popular. Recently, a new approach, information-based optimal subdata selection (IBOSS) method was proposed (Wang, Yang and Stufken, 2018), which applies the design methodology to the big data problem. However, both the leveraging method and the IBOSS method are model-dependent. Space-ﬁlling designs do not suﬀer this drawback, as shown in our simulation studies.
From Random Landscapes to Statistical inference
Consider the problem of recovering a rank 1 tensor of order k that has been subject to additive Gaussian Noise. It is information theoretically possible to recover the tensor with a finite number of samples via maximum likelihood estimation, however, it is expected that one needs a polynomially diverging number of samples to efficiently recover it. What is the cause if this large statistical-to-algorithmic gap? To understand this interesting question of high dimensional statistics, we begin by studying an intimately related question: optimization of random homogenous polynomials on the sphere in high dimensions. We show that the estimation threshold is related to a geometric analogue of the BBP transition for matrices. We then study the threshold for efficient recovery for a simple class of algorithms, Langevin dynamics and gradient descent. We view this problem in terms of a broader class of polynomial optimization problems and propose a mechanism or success/failure of recovery in terms of the strength of the signal on the high entropy region of the initialization. We will review several results including joint works with Ben Arous-Gheissari and Lopatto-Miolane.
How does consumption habit affect the household’s demand for life-contingent claims?
This paper examines the impact of habit formation on demand for life-contingent claims. We propose a life-cycle model with habit formation and solve the optimal consumption, portfolio choice, and life insurance/annuity problem analytically. We illustrate how consumption habits can alter the bequest motive and therefore drive the demand for life-contingent products. Finally, we use our model to examine the mismatch in the life insurance market between the life insurance holdings of most households and their underlying financial vulnerabilities, and the mismatch in the annuity market between the lack of any annuitization and the risk of outliving financial wealth.
If Journals Embraced Conditional Equivalence Testing, Would Research be Better?
Motivated by recent concerns with the reproducibility and reliability of scientific research, we introduce a publication policy that incorporates "conditional equivalence testing" (CET), a two-stage testing scheme in which standard null hypothesis significance testing (NHST) is followed conditionally by testing for equivalence. We explain how such a policy could address issues of publication bias, and investigate similarities with a Bayesian approach. We then develop a novel optimality model that, given current incentives to publish, predicts a researcher's most rational use of resources. Using this model, we are able to determine whether a given policy, such as our CET policy, can incentivize more reliable and reproducible research.
Asymptotically optimal multiple testing with streaming data
The problem of testing multiple hypotheses with streaming (sequential) data arises in diverse applications such as multi-channel signal processing, surveillance systems, multi-endpoint clinical trials, and online surveys. In this talk, we investigate the problem under two generalized error metrics. Under the first one, the probability of at least k mistakes, of any kind, is controlled. Under the second, the probabilities of at least k1 false positives and at least k2 false negatives are simultaneously controlled. For each formulation, we characterize the optimal expected sample size to a first-order asymptotic approximation as the error probabilities vanish, and propose a novel procedure that is asymptotically efficient under every signal configuration. These results are established when the data streams for the various hypotheses are independent and each local log-likelihood ratio statistic satisfies a certain law of large numbers. Further, in the special case of iid observations, we quantify the asymptotic gains of sequential sampling over fixed-sample size schemes.
The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential Privacy
With the unprecedented availability of datasets containing personal information, there are increasing concerns that statistical analysis of such datasets may compromise individual privacy. These concerns give rise to statistical methods that provide privacy guarantees at the cost of some statistical accuracy. A fundamental question is: to satisfy certain desired level of privacy, what is the best statistical accuracy one can achieve? Standard statistical methods fail to yield sharp results, and new technical tools are called for.
In this talk, I will present a general lower bound argument to investigate the tradeoff between statistical accuracy and privacy, with application to three problems: mean estimation, linear regression and classification, in both the classical low-dimensional and modern high-dimensional settings. For these statistical problems, we also design computationally efficient algorithms that match the minimax lower bound under the privacy constraints. Finally I will show the applications of those privacy-preserving algorithms to real data containing sensitive information, such as SNPs and body fat, for which privacy-preserving statistical methods are necessary.
Some Priors for Nonparametric Shrinkage and Bayesian Sparsity Inference
In this talk, I introduce two novel classes of shrinkage priors for different purposes: functional HorseShoe (fHS) prior for nonparametric subspace shrinkage and neuronized priors for general sparsity inference.
In function estimation problems, the fHS prior encourages shrinkage towards parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. I study some desirable theoretical properties including an optimal posterior concentration property on the function and the model selection consistency. I apply the fHS prior to nonparametric additive models for some simulated and real data sets, and the results show that the proposed procedure outperforms the state-of-the-art methods in terms of estimation and model selection.
For general sparsity inference, I propose the neuronized priors to unify and extend existing shrinkage priors such as one-group continuous shrinkage priors, continuous spike-and-slab priors, and discrete spike-and-slab priors with point-mass mixtures. The new priors are formulated as the product of a weight variable and a transformed scale variable via an activation function. By altering the activation function, practitioners can easily implement a large class of Bayesian variable selection procedures. Compared with classic spike and slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variable, which results in more efficient MCMC algorithms and more effective posterior modal estimates. I also show that these new formulations can be applied to more general and complex sparsity inference problems, which are computationally challenging, such as structured sparsity and spatially correlated sparsity problems.
Strategies for scaling iterated conditional Sequential Monte Carlo methods for high dimensional state space models
The iterated Conditional Sequential Monte Carlo (cSMC) method is a particle MCMC method commonly used for state inference in non-linear, non-Gaussian state space models. Standard implementations of iterated cSMC provide an efficient way to sample state sequences in low-dimensional state space models. However, efficiently scaling iterated cSMC methods to perform well in models with a high-dimensional state remains a challenge. One reason for this is the use of a global proposal, without reference to the current state sequence in the MCMC run. In high dimensions, such a proposal will typically not be well-matched to the posterior and impede efficient sampling. I will describe a technique based on the embedded HMM (Hidden Markov Model) framework to construct efficient proposals in high dimensions that are local relative to the current state sequence. A second obstacle to scalability of iterated cSMC is not using the entire observed sequence to construct the proposal. Typical implementations of iterated cSMC use a proposal at time t that that relies only on data up to time t. In high dimensions and in the presence of informative data, such proposals become inefficient, and can considerably slow down sampling. I will introduce a principled approach to incorporating future observations in the cSMC proposal at time t. By considering several examples, I will demonstrate that both strategies improve the performance of iterated cSMC for sequence sampling in high-dimensional state space models.
Impact of preferences on optimal insurance in the presence of multiple policyholders
In the optimal insurance literature, one typically studies optimal risk sharing between one insurer (or reinsurer) and one policyholder. However, the insurance business is based on diversification benefits that arise when pooling many insurance policies. In this paper, we first show that results on optimal insurance that are valid in the case of a single policyholder extend to the case of multiple policyholders, provided their insurance claims are independent. However, due to natural catastrophes, increasing life expectancy and terrorism events, insurance claims show tendency to be correlated. Interestingly, in the case of interdependent insurance policies, it may become optimal for the insurer to refuse selling insurance to some prospects, based on their attitude towards risk or due to their risk exposure characteristics. This finding calls for government policies to ensure that insurance stays available and affordable to everyone.
Bayesian nonparametric models for compositional data
We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.
Learning Optimal Individualized Decision Rules with Risk Control
With the emergence of precision medicine, estimation of optimal individualized decision rules (IDRs) has attracted tremendous attentions in many scientific areas. Most existing literature has focused on finding optimal IDRs that can maximize the expected outcome for each individual. Motivated by complex individualized decision making procedures and the popular conditional value at risk, in this talk, I will introduce two new robust criteria to evaluate IDRs: one is focused on the average lower tail of the subjects’ outcomes and the other is on the individualized lower tail of each subject’s outcome. The proposed criteria take tail behaviors of the outcome into consideration, and thus the resulting optimal IDRs are robust in controlling adverse events. The optimal IDRs under our criteria can be interpreted as the distributionally robust decision rules that maximize the “worst-case” scenario of the outcome within a probability constrained set. Simulation studies and a real data application are used to demonstrate the robust performance of our methods. Finally, I will introduce a more general decision-rule based optimized covariates dependent equivalent framework for individualized decision making with risk control.
ESTIMATION IN PREFERENTIAL ATTACHMENT NETWORKS
Preferential attachment is widely used to model power-law behavior of degree distributions in both directed and undirected networks. Statistical estimates of the tail exponent of the power-law degree distribution often use the Hill estimator as one of the key summary statistics, even though the consistency of the Hill estimator for network data has not been explored. We derive the asymptotic behavior of the joint degree sequences by embedding the in- and out-degrees of a xed node into a pair of switched birth processes with immigration
and then establish the convergence of the joint tail empirical measure. From these steps, the consistency of the Hill estimators is obtained.
Meanwhile, one important practical issue of the tail estimation problem is how to select a threshold above which observations follow a power-law distribution. A minimum distance selection procedure (MDSP) has been widely adopted, especially in the analyses of social networks. However, theoretical justications on this selection procedure remain scant. We then study the asymptotic behavior of the optimal threshold and the corresponding power-law index given by the MDSP. We also nd that the MDSP tends to choose too high a threshold level and leads to Hill estimates with large variances and root mean squared errors for simulated data with Pareto-like tails.
Note: This is based on joint works with S.I. Resnick (Cornell University, US), H. Drees (University of Hamburg, Germany) and A. Janen (KTH Royal Institute of Technology, Sweden).
Nonparametric Overdose Control in Phase I Dose-Finding Clinical Trials
The primary objective of phase I oncology trials is to assess the safety of the new drug. Under the framework of Bayesian model selection, we propose a nonparametric overdose control (NOC) design for dose finding in phase I clinical trials. Each dose assignment is guided via a feasibility bound, which thereby can control the number of patients allocated to excessively toxic dose levels. We further develop a fractional NOC (fNOC) design in conjunction with a so-called fractional imputation approach, to account for late-onset toxicity outcomes. Extensive simulation studies have been conducted to show that both the NOC and fNOC designs have robust and satisfactory finite-sample performance compared with the existing dose-finding designs. The proposed methods also possess several desirable properties: treating patients more safely and also neutralizing the aggressive escalation to overly toxic doses when the toxicity outcomes are late-onset. We also generalize the NOC design to handle drug-combination trials and phase I/II trials.
Efficient Bayesian Approaches for Big Data and Complex Models
Bayesian inference methods are essential for modern data analysis. Ever growing datasets and model complexities, however, pose the major challenges to classical approaches and have motivated many advancements on scalable inference methods and efficient methods that can handle complicated model structures. In this talk, I will describe some recent work on scalable Bayesian inference methods and efficient learning algorithms for complex models, with applications in machine learning and computational biology. By exploiting the regularity of the underlying probabilistic models, I propose an alternative scalable MCMC approach without sacrificing the exploration efficiency, overcoming a potential drawback of classical stochastic gradient MCMC methods. We extend a state-of-the-art MCMC algorithm, Hamiltonian Monte Carlo, to models with both continuous and discrete (structured) parameters, and successfully apply it to Bayesian phylogenetic inference, an important discipline of evolutionary biology that focuses on the reconstruction of the tree of life. Moreover, I propose a novel graphical model, subsplit Bayesian networks (SBNs), that can provide flexible distributions on phylogenetic trees. The flexibility of SBNs not only allows efficient tree probability estimators, but also enables a general variational framework for Bayesian phylogenetic inference that has promising speed and scalability compared to the current random-walk MCMC approaches.
Multivariate Discrete Outcomes: from Marginal Model Diagnostics to Nonparametric Copula Estimation
Multivariate discrete outcomes are common in a wide range of areas including insurance. When the interplay between outcomes is significant, quantifying dependencies among interrelated variables is of great importance. Due to their ability to flexibly accommodate dependence, copulas have been utilized extensively for dependence modeling in insurance. Yet the application of copulas on discrete data is still in its infancy. Although a substantial literature has emerged focusing on copula models under continuity, some key steps and concepts do not carry over to discrete data. One major barrier is the non-uniqueness of copulas, calling into question model interpretations and predictions. We study the issue of identifiability in a regression context and establish the conditions under which copula regression models are identifiable for discrete outcomes. Given uniqueness, we propose a nonparametric estimator of copulas to identify the "hidden" dependence structure for discrete outcomes and develop its asymptotic properties. We explore the finite sample performance of our estimator under different scenarios using extensive simulation studies, and use our model to investigate the dependence of insurance claim frequencies across different business lines using a dataset from the Local Government Property Insurance Fund in the state of Wisconsin.
Beyond copula modeling, we extend some of the key concepts underlying our copula estimator for the purpose of regression diagnostics. Making informed decisions about model adequacy has long been an outstanding issue for discrete outcomes. To fill this gap, we develop an effective diagnostic tool for univariate regression models for discrete outcomes and show that it outperforms Pearson and deviance residuals for various diagnostic tasks.
Department of Statistics and
Actuarial Science (SAS)
Mathematics 3 (M3)
University of Waterloo
Administrative Staff Directory
Phone: 519-888-4567, ext. 43550
The University of Waterloo acknowledges that much of our work takes place on the traditional territory of the Neutral, Anishinaabeg and Haudenosaunee peoples. Our main campus is situated on the Haldimand Tract, the land granted to the Six Nations that includes six miles on each side of the Grand River. Our active work toward reconciliation takes place across our campuses through research, learning, teaching, and community building, and is centralized within our Indigenous Initiatives Office.