Thursday, December 7, 2017 — 4:00 PM EST

Statistical methods for pooled biomarker data

For many health outcomes, it has become increasingly common to aggregate data from multiple studies to obtain increased sample sizes. The enhanced sample size of the pooled data allows investigators to perform subgroup analyses, evaluate the dose-response relationship over a broad range of exposures, and provide robust estimates of the biomarker-disease association. However, study-specific calibration processes must be incorporated in the statistical analyses to address between-study variability in the biomarker measurements. We introduce methods for evaluating the biomarker-disease relationship that validly account for the calibration process. We consider both internal and external calibration studies in the context of nested and unmatched case-control studies. We then illustrate the utility of these estimators using simulations and an application to a circulating vitamin D and colorectal cancer pooling project.

Friday, December 1, 2017 — 2:00 PM EST

Price Dynamics in a General Markovian Limit Order Book

We propose a simple stochastic model for the dynamics of a limit order book, extending the recent work of Cont and de Larrard, where the price dynamics are endogenous, resulting from market transactions.  We also show that the diffusion limit of the price process is the so-called Brownian meander.

Friday, December 1, 2017 — 12:00 PM EST

Friday, November 24, 2017 — 2:00 PM EST

Comonotonic risk measures in a world without risk-free assets

We focus on comonotonic risk measures from the point of view of the primitives of the theory as initially laid down by Artzner et al. (1999): acceptance sets and eligible assets. We show that comonotonicity cannot be characterized by the properties of the acceptance set alone and heavily depends on the choice of the eligible asset. In fact, in many important cases, comonotonicity is only compatible with risk-free eligible assets. These findings seem to question the assumption of comonotonicity in a world of ``discounted'' capital positions and call for a renewed assessment of the meaning and the role of comonotonicity within a capital adequacy framework. Time permitting, we will also discuss some implications for insurance pricing.

Wednesday, November 22, 2017 — 9:00 AM EST

Integrative Directed Cyclic Graphical Models with Heterogeneous Samples

In this talk, I will introduce novel hierarchical directed cyclic graphical models to infer gene networks by integrating genomic data across platforms and across diseases. The proposed model takes into account tumor heterogeneity. In the case of data that can be naturally divided into known groups, we propose to connect graphs by introducing a hierarchical prior across group-specific graphs, including a correlation on edge strengths across graphs. Thresholding priors are applied to induce sparsity of the estimated networks. In the case of unknown groups, we cluster subjects into subpopulations and jointly estimate cluster-specific gene networks, again using similar hierarchical priors across clusters. Two applications with multiplatform genomic data for multiple cancers will be presented to illustrate the utility of our model. I will also briefly discuss my other work and future directions.     

Monday, November 20, 2017 — 9:00 AM EST

Statistical tools for analyzing network-linked data

While classic statistical tools such as regression and graphical models have been well studied, they are no longer applicable when the observations are connected by a network, an increasingly common situation in modern complex datasets. We develop the analogue of loss-based prediction models and graphical models for such network-linked data, by a network-based penalty that can be combined with any number of existing techniques.   We show, both empirically and theoretically, that incorporating network information improves performance on a variety of tasks under the assumption of network cohesion, the empirically observed phenomenon of linked nodes acting similarly.  Computationally efficient algorithms are developed as well for implementing our proposal.     We also consider the general question of how to perform cross-validation and bootstrapping on networks, a long-standing open problem in network analysis.   Model selection and tuning for many tasks can be performed through cross-validation, but splitting network data is non-trivial, since removing links leads to a potential change in network structure.   We propose a new general cross-validation strategy for networks, based on repeatedly removing edge values at random and then applying matrix completion to reconstruct the full network.   We obtain theoretical guarantees for this method under a low rank assumption on the underlying edge probability matrix, and show that the method is computationally efficient and performs well for a wide range of network tasks, in contrast to previously developed approaches that only apply under specific models.    Several real-world examples will be discussed throughout the talk, including the effect of friendship networks on adolescent marijuana usage, phrases that can be learned with the help of a collaboration network of statisticians as well as statistician communities extracted from a citation network.  

Friday, November 17, 2017 — 9:00 AM EST

A Group-Specific Recommender System

In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this article, we propose a group-specific method to use dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the “cold-start” problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on the numbers of ratings from each user and other variables associated with missing patterns. In addition, since this type of data involves large-scale customer records, traditional algorithms are not computationally scalable. To implement the proposed method, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches.

Wednesday, November 15, 2017 — 9:00 AM EST

Causal inference in observational data with unmeasured confounding

Observational data introduces many practical challenges for causal inference. In this talk, I will focus on a particular issue when there are unobserved confounders such that the assumption of “ignorability” is violated. For making a causal inference in the presence of unmeasured confounders, instrumental variable (IV) analysis plays a crucial role. I will introduce a hierarchical Bayesian likelihood-based IV analysis under a Latent Index Modeling framework to jointly model outcomes and treatment status, along with necessary assumptions and sensitivity analysis to make a valid causal inference. The innovation in our methodology is an extension of existing parametric approach by i.) accounting for an unobserved heterogeneity via a latent factor structure, and ii.) allowing non-parametric error distributions with Dirichlet process mixture models. We demonstrate utility of our model in comparing effectiveness of two different types of vascular access for a cardio-vascular procedure.

Monday, November 13, 2017 — 9:00 AM EST

A log-linear time algorithm for constrained changepoint detection

Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. In this talk I will explain how a recently proposed functional pruning algorithm can be generalized to solve such constrained changepoint detection problems. Our proposed log-linear time algorithm achieves state-of-the-art peak detection accuracy in a benchmark of several genomic data sets, and is orders of magnitude faster than our previous quadratic time algorithm. Our implementation is available as the PeakSegPDPA function in the PeakSegOptimal R package,

Friday, November 10, 2017 — 9:00 AM EST

Latent variable modeling: from functional data analysis to cancer genomics

Many important research questions can be answered by incorporating latent variables into the data analysis.  However, this type of modelling requires the development of sophisticated methods and often computational tricks in order to make the inference problem more tractable. In this talk I present an overview of latent variable modelling and show how I have developed different latent variable techniques for several data analyses, two in functional data analysis and one in cancer genomics.

Thursday, November 9, 2017 — 1:30 PM EST

Asymptotic analysis of massive high-dimensional datasets​

In the past twenty years the field of statistics has faced relatively new types of datasets in which both the number of observations $n$ and the number of predictors $p$ are very large.

Unfortunately, one of the most powerful analysis tools, i.e. the large $n$ asymptotic, that had led to many interesting discoveries in statistics in the past few centuries seems to be irrelevant for such high-dimensional problems. As a result, (i) the maximum likelihood estimator is not necessarily an optimal estimator in high-dimensional problems, and (ii) sharp analysis of estimators has become almost impossible, and hence no fair comparison exists among different estimators.

These issues call for a new asymptotic platform that is relevant to high-dimensional settings. In this talk, I will discuss an asymptotic framework in which both $n$ and $p$ go to infinity simultaneously.This asymptotic platform has the following main features: (i) It can be considered as a generalization of the classical asymptotic framework. In other words, it can generate the results of the classical asymptotics as its special case. (ii) It is capable of capturing some peculiar features of high-dimensional problems that were not present in low-dimensional datasets, such as phase transitions. I will use this asymptotic framework to study two classical problems of Statistics, i.e. the problem of linear regression and variable selection. I will describe the progress we have made in the past few years on this problem, and also some of the remaining open questions.

Tuesday, November 7, 2017 — 4:00 PM EST

Pricing Bounds and Bang-bang Analysis of the Polaris Variable Annuities

In this talk, I will discuss the no-arbitrage pricing of the “Polaris Income Plus Daily” structured in the “Polaris Choice IV” variable annuities recently issued by the American International Group. Distinguished from most withdrawal benefits in the literature, Polaris allows the income base to “lock in” the high water mark of the investment account over certain monitoring period, which is related to the timing of policyholder’s first withdrawal. By prudently introducing certain auxiliary state and control variables, we manage to formulate the pricing model under a Markovian stochastic optimal control framework. For the rider charge proportional to the investment account, we establish a bang-bang solution for the optimal withdrawal strategies and show that they can only be among a few explicit choices. We consequently design a novel Least Square Monte Carlo (LSMC) algorithm for the optimal solution. Interesting convergence results are established for the algorithm by applying certain theory of nonparametric sieve estimation. Finally, we formally prove that the pricing results obtained under the ride charge proportional to the investment account works as an upper bound of a contract with insurance fees charged on the income base instead. Numerical studies show the superior performance of the pricing bounds. This talk is based on a joint work with Prof. Chengguo Weng at University of Waterloo.

Friday, November 3, 2017 — 9:00 AM EDT

Detecting Change in Dynamic Networks

Dynamic networks are often used to model the communications, interactions, or relational structure, of a group of individuals through time. In many applications, it is of interest to identify instances or periods of unusual levels of interaction among these individuals. The real-time monitoring of networks for anomalous changes is known as network surveillance.

This talk will provide an overview of the network surveillance problem and propose a network monitoring strategy that applies statistical process monitoring techniques to the estimated parameters of a degree corrected stochastic block model to identify significant structural change. For illustration, the proposed methodology will be applied to a dynamic U.S. Senate co-voting network as well as the Enron email exchange network. Several ongoing and open research problems will also be discussed.

Wednesday, November 1, 2017 — 9:00 AM EDT

A new framework of calibration for computer models: parameterization and efficient estimation

In this talk I will show some theoretical advances on the problem of calibration for computer models. The goal of calibration is to identify the model parameters in deterministic computer experiments, which cannot be measured or are not available in physical experiments. A theoretical framework is given which enables the study of parameter identifiability and estimation. In a study of the prevailing Bayesian method proposed by Kennedy and O’Hagan (2001), Tuo-Wu (2015, 2016) and Tuo-Wang-Wu (2017) find that this method may render unreasonable estimation for the calibration parameters. A novel calibration method, called L2 calibration, is proposed and proven to enjoy nice asymptotic properties, including asymptotic normality and semi-parametric efficiency. Inspired by a new advance in Gaussian process modeling, called orthogonal Gaussian process models (Plumlee and Joseph, 2016, Plumlee 2016), I have proposed another methodology for calibration. This new method is proven to be semi-parametric efficient, and in addition it allows for a simple Bayesian version so that Bayesian uncertainty quantification can be carried out computationally. In some sense, this latest work provides a complete solution to a long-standing problem in uncertainty quantification (UQ).

Tuesday, October 31, 2017 — 1:00 PM EDT

Data Adaptive Support Vector Machine with Application to Prostate Cancer Imaging Data

Support vector machines (SVM) have been widely used as classifiers in various settings including pattern recognition, texture mining and image retrieval. However, such methods are faced with newly emerging challenges such as imbalanced observations and noise data. In this talk, I will discuss the impact of noise data and imbalanced observations on SVM classification and present a new data adaptive SVM classification method.

This work is motivated by a prostate cancer imaging study conducted in London Health Science Center. A primary objective of this study is to improve prostate cancer diagnosis and thereby to guide the treatment based on statistical predictive models. The prostate imaging data, however, are quite imbalanced in that the majority voxels are cancer-free while only a very small portion of voxels are cancerous. This issue makes the available SVM classifiers typically skew to one class and thus generate invalid results. Our proposed SVM method uses a data adaptive kernel to reflect the feature of imbalanced observations; the proposed method takes into consideration of the location of support vectors in the feature space and thereby generates more accurate classification results. The performance of the proposed method is compared with existing methods using numerical studies.

Monday, October 30, 2017 — 4:00 PM EDT

Analysis of Clinical Trials with Multiple Outcomes

In order to obtain better overall knowledge of a treatment effect, investigators in clinical trials often collect many medically related outcomes, which are commonly called as endpoints. It is fundamental to understand the objectives of a particular analysis before applying any adjustment for multiplicity. For example, multiplicity does not always lead to error rate inflation, or multiplicity may be introduced for purpose other than making an efficacy or safety claim such as in sensitivity assessments. Sometimes, the multiple endpoints in clinical trials can be hierarchically ordered and logically related. In this talk, we will discuss the methods to analyze multiple outcomes in clinical trials with different objectives:  all or none approach, global approach, composite endpoint, at-least-one approach.

Thursday, October 26, 2017 — 4:00 PM EDT

Estimation of the expected shortfall given an extreme component under conditional extreme value model

For two risks, $X$, and $Y$ , the Marginal Expected Shortfall (MES) is defined as $E[Y \mid  X > x]$, where $x$ is large. MES is an important factor when measuring the systemic risk of financial institutions. In this talk we will discuss consistency and asymptotic normality of an estimator of MES on assuming that $(X, Y)$ follows a Conditional Extreme Value (CEV) model. The theoretical findings are supported by simulation studies. Our procedure is applied to some financial data. This is a joint work with Kevin Tong (Bank of Montreal).

Thursday, October 12, 2017 — 4:00 PM EDT

Optimal Insurance: Belief Heterogeneity, Ambiguity, and Arrow's Theorem

In Arrow's classical problem of demand for insurance indemnity schedules, it is well-known that the optimal insurance indemnification for an insurance buyer (the insured) is a straight deductible contract, when the insurer is a risk-neutral Expected Utility (EU) maximizer and when the insured is a risk-averse EU maximizer. In Arrow's framework, however, the two parties share the same probabilistic beliefs about the realizations of the underlying insurable loss, and neither party experiences ambiguity (Knightian uncertainty) about the distribution of this random loss. In this talk, I will discuss extensions of Arrow's classical result to situations of belief heterogeneity and ambiguity.

Thursday, October 5, 2017 — 4:00 PM EDT

Statistically and Numerically Efficient Independence Test

We study how to generate a statistical inference procedure that is both computational efficient and having theoretical guarantee on its statistical performance. Test of independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distance-based methods (such as the distance correlation based hypotheses testing for independence) have numerous advantages, comparing with many other alternatives. A known limitation of the distance-based method is that its computational complexity can be high. In general, when the sample size is n, the order of computational complexity of a distance-based method, which typically requires computing of all pairwise distances, can be O(n^2). Recent advances have discovered that in the univariate cases, a fast method with O(n log n) computational complexity and O(n) memory requirement exists. In this talk, I introduce a test of independence method based on random projection and distance correlation, which achieves nearly the same power as the state-of-the-art distance-based approach, works in the multivariate cases, and enjoys the O(n K log n) computational complexity and O(max{n,K}) memory requirement, where K is the number of random projections. Note that saving is achieved when K < n/ log n. We name our method a Randomly Projected Distance Covariance (RPDC). The statistical theoretical analysis takes advantage of some techniques on random projection, which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, in relative to several competitors.

Thursday, September 28, 2017 — 4:00 PM EDT
Susan Murphy

Challenges in Developing Learning Algorithms to Personalize Treatment in Real Time

A formidable challenge in designing sequential treatments is to  determine when and in which context it is best to deliver treatments.  Consider treatment for individuals struggling with chronic health conditions.  Operationally designing the sequential treatments involves the construction of decision rules that input current context of an individual and output a recommended treatment.   That is, the treatment is adapted to the individual's context; the context may include  current health status, current level of social support and current level of adherence for example.  Data sets on individuals with records of time-varying context and treatment delivery can be used to inform the construction of the decision rules.    There is much interest in personalizing the decision rules, particularly in real time as the individual experiences sequences of treatment.   Here we discuss our work in designing  online "bandit" learning algorithms for use in personalizing mobile health interventions. 

Tuesday, September 26, 2017 — 4:00 PM EDT

Owning Your Research: Things I wish I knew for graduate study

Pursuing graduate study is a courageous decision for life that takes time, effort, and commitment. Graduate study can be mysterious at the beginning, miserable in the process, and marvelous at the end. As a recent grad, I am going to share some of my graduate study experiences in this presentation. In particular, I hope to give some advice to current graduate students to make their graduate study easier, faster, and happier. Among other recommendations, I will provide some tricks and tips on coding that can improve research productivity. 

Thursday, September 21, 2017 — 4:00 PM EDT

Empirical balancing scores and balancing weights

Propensity scores have been central to causal inference and are often used as balancing scores or balancing weights. Estimated propensity scores, however, may exhibit undesirable finite-sample performance. We take a step back to understand what properties of balancing scores and weights are desirable. For balancing scores, the dimension reduction aspect is important; whereas for balancing weights, a conditional moment balancing property is crucial. Based on these considerations, a joint sufficient dimension reduction framework is proposed for balancing scores, and a covariate functional balancing framework is proposed for balancing weights. 

Monday, September 18, 2017 — 4:00 PM EDT

Multivariate Quantiles: Nonparametric Estimation and Applications to Risk Management

In many applications of hydrology, quantiles provide important insights in the statistical problems considered. In this talk, we focus on the estimation of a notion of multivariate quantiles based on copulas and provide a nonparametric estimation procedure. These quantiles are based on particular level sets of copulas and admit the usual probabilistic interpretation that a p-quantile comprises a probability mass p. We also explore the usefulness of a smoothed bootstrap in the estimation process. Our simulation results show that the nonparametric estimation procedure yields excellent results in finite samples and that the smoothed bootstrap can be beneficially applied.

Thursday, September 14, 2017 — 4:00 PM EDT

Variable selection for case-cohort studies with failure time outcome

Case-cohort designs are widely used in large cohort studies to reduce the cost associated with covariate measurement. In many such studies the number of covariates is very large, so an efficient variable selection method is necessary. We investigated the properties of a variable selection procedure using the smoothly clipped absolute deviation penalty in a case-cohort design with a diverging number of parameters. We establish the consistency and asymptotic normality of the maximum penalized pseudo-partial-likelihood estimator, and show that the proposed variable selection method is consistent and has an asymptotic oracle property. Simulation studies compare the finite-sample performance of the procedure with tuning parameter selection methods based on the Akaike information criterion and the Bayesian information criterion. We make recommendations for use of the proposed procedures in case-cohort studies, and apply them to the Busselton Health Study.

Friday, September 8, 2017 — 10:00 AM EDT

Clustering Heavy-Tailed Stable Data


  1. 2022 (55)
    1. July (1)
    2. June (3)
    3. May (5)
    4. April (9)
    5. March (12)
    6. February (7)
    7. January (19)
  2. 2021 (89)
    1. December (12)
    2. November (12)
    3. October (8)
    4. September (5)
    5. July (4)
    6. June (3)
    7. May (6)
    8. April (8)
    9. March (13)
    10. February (7)
    11. January (12)
  3. 2020 (71)
  4. 2019 (65)
  5. 2018 (44)
  6. 2017 (55)
    1. December (3)
    2. November (11)
    3. October (5)
    4. September (7)
    5. August (1)
    6. July (1)
    7. June (2)
    8. May (4)
    9. April (2)
    10. March (3)
    11. February (4)
    12. January (12)
  7. 2016 (44)
  8. 2015 (37)
  9. 2014 (43)
  10. 2013 (46)
  11. 2012 (44)