Thursday, November 29, 2018 — 4:00 PM EST

**Computational Aspects of Robust Optimized Certainty Equivalent and Option Pricing**

We present a robust extension under distribution uncertainty of optimized certainty equivalent that includes the expected shortfall. We show that the infinite dimensional optimization problem can be reduced to a finite one using transport duality methods. Some important cases such as the Expected Shortfall can even be computed explicitly and provide insights about the additional costs from distributional uncertainty.

The general result can be further applied for explicit computation of robust option price where we also provide some explicit formulas in cases of call options. We finally address dual representation of the robust optimized certainty equivalent.

This talk is based on a joint work with Daniel Bartle and Ludovic Tangpi.

Thursday, November 22, 2018 — 4:00 PM EST

**A Bayesian Approach to Joint Modeling of Matrix-valued Imaging Data and Treatment Outcome with Applications to Depression Studies**

In this talk, we discuss a unified Bayesian joint modeling framework for studying association between a binary treatment outcome and a baseline matrix-valued predictor. Specifically, a joint modeling approach relating an outcome to a matrix-valued predictor through a probabilistic formulation of multilinear principal component analysis (MPCA) is developed. This framework establishes a theoretical relationship between the outcome and the matrix-valued predictor although the predictor is not explicitly expressed in the model. Simulation studies are provided showing that the proposed method is superior or competitive to other methods, such as a two-stage approach and a classical principal component regression (PCR) in terms of both prediction accuracy and estimation of association; its advantage is most notable when the sample size is small and the dimensionality in the imaging covariate is large. Finally, our proposed joint modeling approach is shown to be a very promising tool in an application exploring the association between baseline EEG data and a favorable response to treatment in a depression treatment study by achieving a substantial improvement in prediction accuracy in comparison to competing methods.

Wednesday, November 21, 2018 — 4:00 PM EST

**Eigen Portfolio Selection: A Robust Approach to Sharpe Ratio Maximization**

We show that even when a covariance matrix is poorly estimated, it is still possible to obtain a robust maximum Sharpe ratio portfolio by exploiting the uneven distribution of estimation errors across principal components. This is accomplished by approximating an investor’s view on future asset returns using a few relatively accurate sample principal components. We discuss two approximation methods. The first method leads to a subtle connection to existing approaches in the literature; while the second one is novel and able to address main shortcomings of existing methods.

*** Pizza & refreshments will be provided ***

*Everyone welcome!*

Friday, November 16, 2018 — 11:00 AM EST

**The Ising model: series expansions and new algorithms**

We propose new and simple Monte Carlo methods to estimate the partition function of the Ising model. The methods are based on the well-known series expansion of the partition function from statistical physics. For the Ising model, typical Monte Carlo methods work well at high temperature, but fail in the low-temperature regime. We demonstrate that our proposed Monte Carlo methods work differently: they behave particularly well at low temperature. We also compare the accuracy of our estimators with the state-of-the-art variational methods.

Thursday, November 8, 2018 — 4:00 PM EST

**Ghost Data**

As natural as the real data, ghost data is everywhere—it is just data that you cannot see. We need to learn how to handle it, how to model with it, and how to put it to work. Some examples of ghost data are (see, Sall, 2017):

(a) Virtual data—it isn’t there until you look at it;

(b) Missing data—there is a slot to hold a value, but the slot is empty;

(c) Pretend data—data that is made up;

(d) Highly Sparse Data—whose absence implies a near zero, and

(e) Simulation data—data to answer “what if.”

For example, absence of evidence/data is not evidence of absence. In fact, it can be evidence of something. More Ghost Data can be extended to other existing areas: Hidden Markov Chain, Two-stage Least Square Estimate, Optimization via Simulation, Partition Model, Topological Data, just to name a few.

Three movies will be discussed in this talk: (1) “The Sixth Sense” (Bruce Wallis)—I can see things that you cannot see; (2) “Sherlock Holmes” (Robert Downey)—absence of expected facts; and (3) “Edge of Tomorrow” (Tom Cruise)—how to speed up your learning (AlphaGo-Zero will also be discussed). It will be helpful, if you watch these movies before coming to my talk. This is an early stage of my research in this area--any feedback from you is deeply appreciated. Much of the basic idea is highly influenced via Mr. John Sall (JMP-SAS).

Thursday, November 1, 2018 — 4:00 PM EDT

**Copula Gaussian graphical models for functional data**

We consider the problem of constructing statistical graphical models for functional data; that is, the observations on the vertices are random functions. This types of data are common in medical applications such as EEG and fMRI. Recently published functional graphical models rely on the assumption that the random functions are Hilbert-space-valued Gaussian random elements. We relax this assumption by introducing a copula Gaussian random elements Hilbert spaces, leading to what we call the Functional Copula Gaussian Graphical Model (FCGGM). This model removes the marginal Gaussian assumption but retains the simplicity of the Gaussian dependence structure, which is particularly attractive for large data. We develop four estimators, together with their implementation algorithms, for the FCGGM. We establish the consistency and the convergence rates of one of the estimators under different sets of sufficient conditions with varying strengths. We compare our FCGGM with the existing functional Gaussian graphical model by simulation, under both non-Gaussian and Gaussian graphical models, and apply our method to an EEG data set to construct brain networks.

Tuesday, October 30, 2018 — 4:00 PM EDT

**Systemic risk and the optimal capital requirements in a model of financial networks and fire sales**

I consider an interbank network with fire sales externalities and multiple illiquid assets and study the problem of optimally trading off between capital reserves and systemic risk. I find that the problem of measuring systemic risk and the optimal capital requirements under various liquidation rules can be formulated as a convex and convex mixed integer programming. To solve the convex MIP, I offer an iterative algorithm that converges to the optimal solutions. I show the results of the methodology through numerical examples and provide implications for regulatory policies and related research topics.

Thursday, October 25, 2018 — 4:00 PM EDT

**Causal Inference with Unmeasured Confounding: an Instrumental Variable Approach**

Causal inference is a challenging problem because causation cannot be established from observational data alone. Researchers typically rely on additional sources of information to infer causation from association. Such information may come from powerful designs such as randomization, or background knowledge such as information on all confounders. However, perfect designs or background knowledge required for establishing causality may not always be available in practice. In this talk, I use novel causal identification results to show that the instrumental variable approach can be used to combine the power of design and background knowledge to draw causal conclusions. I also introduce novel estimation tools to construct estimators that are robust, efficient and enjoy good finite sample properties. These methods will be discussed in the context of a randomized encouragement design for a flu vaccine.

Friday, October 19, 2018 — 11:00 AM EDT

**Efficient Estimation, Robust Testing and Design Optimality for Two-Phase Studies**

Two-phase designs are cost-effective sampling strategies when some covariates are expensive to be measured on all study subjects. Well-known examples include case-control, case-cohort, nested case-control and extreme tail sampling designs. In this talk, I will discuss three important aspects in two-phase studies: estimation, hypothesis testing and design optimality. First, I will discuss efficient estimation methods we have developed for two-phase studies. We allow expensive covariates to be correlated with inexpensive covariates collected in the first phase. Our proposed estimation is based on maximization of a modified nonparametric likelihood function through a generalization of the expectation-maximization algorithm. The resulting estimators are shown to be consistent, asymptotically normal and asymptotically efficient with easily estimated variances. Second, I will focus on hypothesis testing in two-phase studies. We propose a robust test procedure based on imputation. The proposed procedure guarantees preservation of type I error, allows high-dimensional inexpensive covariates, and yields higher power than alternative imputation approaches. Finally, I will present some recent development on design optimality. We show that for general outcomes, the most efficient design is an extreme-tail sampling design based on certain residuals. This conclusion also explains the high efficiency of extreme tail sampling for continuous outcomes and balanced case-control design for binary outcomes. Throughout the talk, I will present numerical evidences from simulation studies and illustrate our methods using different applications.

Wednesday, October 17, 2018 — 4:00 PM EDT

**Uncovering the Mechanisms of General Anesthesia: Where Neuroscience Meets Statistics**

General anesthesia is a drug-induced, reversible condition involving unconsciousness, amnesia (loss of memory), analgesia (loss of pain sensation), akinesia (immobility), and hemodynamic stability. I will describe a primary mechanism through which anesthetics create these altered states of arousal. Our studies have allowed us to give a detailed characterization of the neurophysiology of loss and recovery of consciousness, in the case of propofol, and we have demonstrated that the state of general anesthesia can be rapidly reversed by activating specific brain circuits. The success of our research has depended critically on tight coupling of experiments, statistical signal processing and mathematical modeling.

Thursday, October 11, 2018 — 4:00 PM EDT

**Probabilistic approaches to mine association rules**

Mining association rules is an important and widely applied data mining technique for discovering patterns in large datasets. However, the used support-confidence framework has some often overlooked weaknesses. This talk introduces a simple stochastic model and shows how it can be used in association rule mining. We apply the model to simulate data for analyzing the behavior and shortcomings of confidence and other measures of interestingness (e.g., lift). Based on these findings, we develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and we define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.

Thursday, October 4, 2018 — 4:00 PM EDT

**Methods for High Dimensional Compositional Data Analysis in Microbiome Studies**

Human microbiome studies using high throughput DNA sequencing generate compositional data with the absolute abundances of microbes not recoverable from sequence data alone. In compositional data analysis, each sample consists of proportions of various organisms with a unit sum constraint. This simple feature can lead traditional statistical methods when naively applied to produce errant results and spurious associations. In addition, microbiome sequence data sets are typically high dimensional, with the number of taxa much greater than the number of samples. These important features require further development of methods for analysis of high dimensional compositional data. This talk presents several latest developments in this area, including methods for estimating the compositions based on sparse count data, two-sample test for compositional vectors and regression analysis with compositional covariates. Several microbiome studies at the University of Pennsylvania are used to illustrate these methods and several open questions will be discussed.

Saturday, September 29, 2018 — 8:00 AM to 6:00 PM EDT

We are extending an invitation to a **select group of talented undergraduate, graduate and PhD students **to participate in the upcoming **University of Waterloo Datathon**.

Thursday, September 27, 2018 — 4:00 PM EDT

**Agent-based Asset Pricing, Learning, and Chaos**

The Lucas asset pricing model is one of the most studied model in financial economics in the past decade. In our research, we relax the original assumptions in Lucas model of homogeneous agents and rational expectations. We populate an artificial economy with heterogeneous and boundedly rational agents. By defining a Correct Expectations Equilibrium, agents are able to compute their policy functions and the equilibrium pricing function without perfect information about the market. A natural adaptive learning scheme is given to agents to update their predictions. We examine the convergence of equilibrium with this learning scheme and show that the equilibrium is learnable (convergent) under certain parameter combinations. We also investigate the market dynamics when agents are out of equilibrium, including the cases where prices have excess volatility and the trading volume is high. Numerical simulations show that our system exhibits rich dynamics, including a whole cascade from period doubling bifurcations to chaos.

Thursday, September 20, 2018 — 4:00 PM EDT

**Bayesian Approaches to Dynamic Model Selection**

In many applications, investigators monitor processes that vary in space and time, with the goal of identifying temporally persistent and spatially localized departures from a baseline or ``normal" behavior. In this talk, I will first discuss a principled Bayesian approach for estimating time varying functional connectivity networks from brain fMRI data. Dynamic functional connectivity, i.e., the study of how interactions among brain regions change dynamically over the course of an fMRI experiment, has recently received wide interest in the neuroimaging literature. Our method utilizes a hidden Markov model for classification of latent neurological states, achieving estimation of the connectivity networks in an integrated framework that borrows strength over the entire time course of the experiment. Furthermore, we assume that the graph structures, which define the connectivity states at each time point, are related within a super-graph, to encourage the selection of the same edges among related graphs. Then, I will propose a Bayesian nonparametric model selection approach with an application to the monitoring of pneumonia and influenza (P&I) mortality, to detect influenza outbreaks in the continental United States. More specifically, we introduce a zero-inflated conditionally identically distributed species sampling prior which allows borrowing information across time and to assign data to clusters associated to either a null or an alternate process. Spatial dependences are accounted for by means of a Markov random field prior, which allows to inform the selection based on inferences conducted at nearby locations. We show how the proposed modeling framework performs in an application to the P&I mortality data and in a simulation study, and compare with common threshold methods for detecting outbreaks over time, with more recent Markov switching based models, and with other Bayesian nonparametric priors that do not take into account spatio-temporal dependence.

Thursday, September 13, 2018 — 4:00 PM EDT

**The Critical Role of Statistics in Evaluating Forensic Evidence**

Statisticians have been important contributors to many areas of science, including chemistry (chemometrics), biology (genomics), medicine (clinical trials), and agriculture (crop yield), leading to valuable advances in statistical research that have benefitted multiple fields (e.g., spectral analysis, penalized regression, sequential analysis, experimental design). Yet the involvement of statistics specifically in forensic science has not been nearly as extensive, especially in view of its importance (ensuring proper administration of justice) and the value it has demonstrated thus far (e.g., forensic DNA, assessment of bullet lead evidence, significance of findings in the U.S. anthrax investigation, reliability of eyewitness identification). Forensic methods in many areas remain unvalidated, as recent investigations have highlighted (notably, bite marks and hair analysis). In this talk, I will provide three examples (including one with data kindly provided by Canadian Forensic Service) where statistics plays a vital role in evaluating forensic evidence and motivate statistical research that can have both theoretical and practical value, and ultimately strengthen forensic evidence.

Thursday, August 9, 2018 — 4:00 PM EDT

**An Adaptive-to-Model Test for Parametric Single-Index Errors-in-Variables Models**

This seminar talks about some useful tests for fitting a parametric single-index regression model when covariates are measured with error and validation data is available. We propose two tests whose consistency rates do not depend on the dimension of the covariate vector when an adaptive-to-model strategy is applied. One of these tests has a bias term that becomes arbitrarily large with increasing sample size but its asymptotic variance is smaller, and the other is asymptotically unbiased with larger asymptotic variance. Compared with the existing local smoothing tests, the new tests behave like a classical local smoothing test with only one covariate, and still are omnibus against general alternatives.

This avoids the difficulty associated with the curse of dimensionality.

Further, a systematic study is conducted to give an insight on the effect of the values of the ratio between the sample size and the size of validation data on the asymptotic behavior of these tests. Simulations are conducted to examine the performance in several finite sample scenarios.

Wednesday, August 8, 2018 — 4:00 PM EDT

**Model Confidence Bounds for Variable Selection**

In this article, we introduce the concept of model confidence bounds (MCBs) for variable selection in the context of nested models. Similarly to the endpoints in the familiar confidence interval for parameter estimation, the MCBs identify two nested models (upper and lower confidence bound models) containing the true model at a given level of confidence. Instead of trusting a single selected model obtained from a given model selection method, the MCBs proposes a group of nested models as candidates and the MCBs’ width and composition enable the practitioner to assess the overall model selection uncertainty. A new graphical tool — the model uncertainty curve (MUC) — is introduced to visualize the variability of model selection and to compare different model selection procedures. The MCBs methodology is implemented by a fast bootstrap algorithm that is shown to yield the correct asymptotic coverage under rather general conditions. Our Monte Carlo simulations and a real data example confirm the validity and illustrate the advantages of the proposed method.

Thursday, August 2, 2018 — 4:00 PM EDT

**No Such Thing as Missing Data**

The phrase "missing data" has come to mean "information we really, really wish we had". But is it actually data, and is it actually missing? I will discuss the practical implications of taking a different philosophical perspective, and demonstrate the use of a simple model for informative observation in longitudinal studies that does not require any notion of missing data.

Thursday, July 19, 2018 — 4:00 PM EDT

**Extracting Latent States from High Frequency Option Prices**

We propose the realized option variance as a new observable variable to integrate high frequency option prices in the inference of option pricing models. Using simulation and empirical studies, this paper documents the incremental information offered by this realized measure. Our empirical results show that the information contained in the realized option variance improves the inference of model variables such as the instantaneous variance and variance jumps of the S&P 500 index. Parameter estimates indicate that the risk premium breakdown between jump and diffusive risks is affected by the omission of this information.

Tuesday, July 3, 2018 — 4:00 PM EDT

**Dirichlet Process and Poisson-Dirichlet Distribution**

Dirichlet process and Poisson-Dirichlet distribution are closely related random measures that arise in a wide range of subjects. The talk will focus on their constructions and asymptotic behaviour in different regimes including the law of large numbers, the fluctuation theorems, and large deviations.

Monday, June 25, 2018 — 4:00 PM EDT

**Assessing financial model risk**

Model risk has a huge impact on any financial or insurance risk measurement procedure and its quantification is therefore a crucial step. In this talk, we introduce three quantitative measures of model risk when choosing a particular reference model within a given class: the absolute measure of model risk, the relative measure of model risk and the local measure of model risk. Each of the measures has a specific purpose and so allows for flexibility. We illustrate the various notions by studying some relevant examples, so as to emphasize the practicability and tractability of our approach.

Thursday, May 24, 2018 — 4:00 PM EDT

**Does your phylogenetic tree fit your data?**

Phylogenetic methods are used to infer ancestral relationships based on genetic and morphological data. What started as more sophisticated clustering has now become a more and more complex machinery of estimating ancestral processes and divergence times. One major branch of inference is maximum likelihood methods. Here, one selects the parameters from a given model class for which the data are more likely to occur than for any other set of parameters of the same model class. Most analysis of real data is executed using such methods.

However, one step of statistical inference that has little exposure to application is the goodness of fit test between inferred model and data. There seem to be various reasons for this behaviour, users are either content with using a bootstrap approach to obtain support for the inferred topology, are afraid that a goodness of fit test would find little or no support for their phylogeny thus demeaning their carefully assembled data, or they simply lack the statistical background to acknowledge this step.

Recently, methods to detect sections of the data which do not support the inferred model have been proposed, and strategies to explain these differences have been devised. In this talk I will present and discuss some of these methods, their shortcomings and possible ways of improving them.

Thursday, May 17, 2018 — 4:00 PM EDT

**Combining phenotypes, genotypes and genealogies to find trait-influencing variants**

A basic tenet of statistical genetics is that shared ancestry leads to trait similarities in individuals. Related individuals share segments of their genome, derived from a common ancestor. The coalescent is a popular mathematical model of the shared ancestry that represents the relationships amongst segments as a set of binary trees, or genealogies, along the genome. While these genealogies cannot be observed directly, the genetic-marker data enable us to sample from their posterior distribution. We may compare the clustering of trait values on each genealogical tree that is sampled to the clustering under the coalescent prior distribution. This comparison provides a latent p-value that reflects the degree of surprise about the trait clustering in the sampled tree. The distribution of these latent p-values is the fuzzy p-value as defined by Geyer and Thompson. The fuzzy p-value contrasts the posterior and prior distributions of trait clustering on the latent genealogies and is informative for mapping trait-influencing variants. In this talk, I will discuss these ideas with application to data from an immune-marker study, present results from preliminary analyses and highlight potential avenues for further research.

Saturday, May 12, 2018 — 8:00 AM to 6:00 PM EDT