Thursday, December 19, 2019 — 10:00 AM EST

#### Uncover Hidden Fine-Grained Scientific Information: Structured Latent Attribute Models

In modern psychological and biomedical research with diagnostic purposes, scientists often formulate the key task as inferring the fine-grained latent information under structural constraints. These structural constraints usually come from the domain experts’ prior knowledge or insight. The emerging family of Structured Latent Attribute Models (SLAMs) accommodate these modeling needs and have received substantial attention in psychology, education, and epidemiology. SLAMs bring exciting opportunities and unique challenges. In particular, with high-dimensional discrete latent attributes and structural constraints encoded by a design matrix, one needs to balance the gain in the model’s explanatory power and interpretability, against the difficulty of understanding and handling the complex model structure.

In the first part of this talk, I present identifiability results that advance the theoretical knowledge of how the design matrix influences the estimability of SLAMs. The new identifiability conditions guide real-world practices of designing diagnostic tests and also lay the foundation for drawing valid statistical conclusions. In the second part, I introduce a statistically consistent penalized likelihood approach to selecting significant latent patterns in the population. I also propose a scalable computational method. These developments explore an exponentially large model space involving many discrete latent variables, and they address the estimation and computation challenges of high-dimensional SLAMs arising from large-scale scientific measurements. The application of the proposed methodology to the data from an international educational assessment reveals meaningful knowledge structure of the student population.

Monday, December 16, 2019 — 10:00 AM EST

#### The Blessings of Multiple Causes

Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the causal variables and the outcome variables. But whether we have observed all confounders is a famously untestable assumption. We describe the deconfounder, a way to do causal inference from observational data allowing for unobserved confounding.

How does the deconfounder work? The deconfounder is designed for problems of multiple causal inferences: scientific studies that involve many causes whose effects are simultaneously of interest. The deconfounder uses the correlation among causes as evidence for unobserved confounders, combining unsupervised machine learning and predictive model checking to perform causal inference. We study the theoretical requirements for the deconfounder to provide unbiased causal estimates, along with its limitations and tradeoffs. We demonstrate the deconfounder on real-world data and simulation studies.

Thursday, December 5, 2019 — 4:00 PM EST

#### Having impact with data science in industry … Data science as an integrated set of skills in data analytics, data engineering and data entrepreneurship

Industry is going through rapid and profound changes, and the possibilities created by data science are one of the phenomena driving them. But data science is more than analytics and machine learning, and students need a T-shaped package of skills to have a successful career.

This talk sketches the role of data science in a rapidly changing industry. We discuss applications of data science in different innovation horizons, from improvement of current processes and product lines, to new business models and disruptive innovation driven by data. Having impact with data science in large, complex organizations is a challenge. It requires a blend of skills in analytics and machine learning, knowledge of computer science and IT infrastructures, and expertise in entrepreneurial project management. The talk presents a framework for organizing data science in CRISP-DM projects, and the various roles of data analysts, data engineers, domain experts, and executives. I also present the teaching philosophy of the Jheronimus Academy of Data Science in The Netherlands, where we design teaching programs around the three pillars of data analytics, engineering and entrepreneurship, and where programs are delivered in close collaboration with six application domains. Finally, I share some personal observations on the role of statistical thinking in the computer-science dominated world of data science.

Friday, November 29, 2019 — 10:30 AM EST

#### Noncausal Affine Processes with Applications to Derivative Pricing

Linear factor models, where the factors are affine processes, play a key role in Finance, since they allow for quasi-closed form expressions of the term structure of risks. We introduce the class of noncausal affine linear factor models by considering factors that are affine in reverse time. These models are especially relevant for pricing sequences of speculative bubbles. We show that they feature much more complicated non affine dynamics in calendar time, while still providing (quasi) closed form term structures and derivative pricing formulas. The framework is illustrated with zero-coupon bond and European call option pricing examples.

Thursday, November 28, 2019 — 4:00 PM EST

#### Bayesian inference of dynamic systems via constrained Gaussian processes

Ordinary differential equations are an important tool for modeling behaviors in science, such as gene regulation, epidemics, etc. An important statistical problem is to infer and characterize the uncertainty of parameters that govern the equations. We present a fast Bayesian inference method using constrained Gaussian processes, such that the derivatives of the Gaussian process must satisfy the dynamics of the differential equations. Our method completely avoids the numerical solver and is thus practically fast to compute. Our construction is cleanly embedded in a rigorous Bayesian framework, and is demonstrated to yield fast and reliable inference in a variety of practical scenarios.

Friday, November 22, 2019 — 10:30 AM EST

#### Do Jumps Matter in the Long Run? A Tale of Two Horizons

Economic scenario generators (ESGs) for equities are important components of the valuation and risk management process of life insurance and pension plans. As the resulting liabilities are very long-lived, it is a desired feature of an ESG to replicate equity returns over such horizons. However, the short-term performance of the assets backing these liabilities may also trigger significant losses and in turn, affect the financial stability of the insurer or plan. For example, a line of GLWBs with frequent withdrawals may trigger losses when subaccounts suddenly lose after a stock market crash or pension contributions may also need to be revised after a long-lasting economic slump. Therefore, the ESG must replicate both short- and long-term stock price dynamics in a consistent manner, which is a critical problem in actuarial finance. Popular features of financial models include stochastic volatility and jumps, and as such, we would like to investigate how these features matter for typical long-term actuarial applications.

For a model to be useful in actuarial finance, it should at least replicate the dynamics of daily, monthly and annual returns (and any frequency in between). A crucial characteristic of returns at these scales is that the kurtosis tends to be very high on a daily basis (25-30) but close to 4-5 on an annual basis. We show that jump-diffusion models, featuring both stochastic volatility and jumps, cannot replicate such features if estimated with the maximum likelihood. Using the generalized method of moments, we find that simple jump-diffusion models or regime-switching models (with at least three regimes) have an excellent fit for various moments observed at different time scales. Finally, we investigate three typical actuarial applications: $1 accumulated in the long run with no intermediate monitoring, a long-term solvency analysis with frequent monitoring and a portfolio rebalancing problem, also with frequent monitoring and updates. Overall, we find that a stochastic volatility model with independent jumps or a regime-switching lognormal model with three regimes, both fitted with the GMM, yield the best fit to moments at different scales and also provide the most conservative figures in actuarial applications, especially when there is intermediate monitoring.

So yes, jumps or jump-like features are essential in the long run. This also illustrates how typical actuarial models fitted with the maximum likelihood may be inadequate for reserving, economic capital and solvency analyses.

Thursday, November 21, 2019 — 4:00 PM EST

#### A General Framework for Quantile Estimation with Incomplete Data

Quantile estimation has attracted significant research interests in recent years. However, there has been only a limited literature on quantile estimation in the presence of incomplete data. In this paper, we propose a general framework to address this problem. Our framework combines the two widely adopted approaches for missing data analysis, the imputation approach and the inverse probability weighting approach, via the empirical likelihood method. The proposed method is capable of dealing with many different missingness settings. We mainly study three of them: (i) estimating the marginal quantile of a response that is subject to missingness while there are fully observed covariates; (ii) estimating the conditional quantile of a fully observed response while the covariates are partially available; and (iii) estimating the conditional quantile of a response that is subject to missingness with fully observed covariates and extra auxiliary variables. The proposed method allows multiple models for both the missingness probability and the data distribution. The resulting estimators are multiply robust in the sense that they are consistent if any one of these models is correctly specified. The asymptotic distributions are established using the empirical process theory.

Joint work with Peisong Han, Jiwei Zhao and Xingcai Zhou.

Friday, November 15, 2019 — 10:30 AM EST

**Insurance Pricing in a Competitive Market**

Insurance is usually defined as "the contribution of the many to the misfortune of the few". This idea of pooling risks together using the law of large number legitimates the use of the expected value as actuarial "fair" premium. In the context of heterogeneous risks, nevertheless, it is possible to legitimate price segmentation based on observable characteristics. But in the context of "Big Data", intensive segmentation can be observed, with a much wider range of offered premium, on a given portfolio. In this talk, we will briefly get back on economical, actuarial and philosophical approaches of insurance pricing, trying to link a fair unique premium on a given population and a highly segmented one. We will then get back on recent experiments (so-called "actuarial pricing game") organized since 2015, where (real) actuaries were playing in competitive (artificial) market, that mimic real insurance market. We will get back on conclusions obtained on two editions, the first one, and the most recent one, where a dynamic version of the game was launched.

Thursday, November 14, 2019 — 4:00 PM EST

#### On Khintchine's Inequality for Statistics

In complex estimation and hypothesis testing settings, it may be impossible to compute p-values or construct confidence intervals using classical analytic approaches like asymptotic normality. Instead, one often relies on randomization and resampling procedures such as the bootstrap or permutation test. But these approaches suffer from the computational burden of large scale Monte Carlo runs. To remove this burden, we develop analytic methods for hypothesis testing and confidence intervals by specifically considering the discrete finite sample distributions of the randomized test statistic. The primary tool we use to achieve such results is Khintchine's inequality and its extensions and generalizations.

Friday, November 8, 2019 — 10:30 AM EST

**Transformed norm risk measures on their natural domain**

P. Cheridito, and T. Li (2009) introduced the class of transformed norm risk measures. This is a fairly large class of real-valued convex law-invariant monetary risk measures, which includes the expected shortfall, the Haezendonck-Goovaerts risk measure, the entropic risk measure and other important examples. The natural domain of a transformed norm risk measure T is an appropriate Orlicz space. Nonetheless, dual representations for this class of risk measures are only known if T is restricted on the Orlicz heart. In this talk we will explore dual representations on their natural domain. Moreover we will discuss continuity properties of dilatation monotone risk measures on general model spaces that are of independent interest.

Thursday, November 7, 2019 — 4:00 PM EST

#### Nonregular and Minimax Estimation of Individualized Thresholds in High Dimension with Binary Responses

Given a large number of covariates $\bZ$, we consider the estimation of a high-dimensional parameter $\btheta$ in an individualized linear threshold $\btheta^T\bZ$ for a continuous variable $X$, which minimizes the disagreement between $\sign{X-\btheta^T\bZ}$ and a binary response $Y$. While the problem can be formulated into the M-estimation framework, minimizing the corresponding empirical risk function is computationally intractable due to discontinuity of the sign function. Moreover, estimating $\btheta$ even in the fixed-dimensional setting is known as a nonregular problem leading to nonstandard asymptotic theory. To tackle the computational and theoretical challenges in the estimation of the high-dimensional parameter $\btheta$, we propose an empirical risk minimization approach based on a regularized smoothed non-convex loss function. The Fisher consistency of the proposed method is guaranteed as the bandwidth of the smoothed loss is shrunk to 0. Statistically, we show that the finite sample error bound for estimating $\btheta$ in $\ell_2$ norm is $(s\log d/n)^{\beta/(2\beta+1)}$, where $d$ is the dimension of $\btheta$, $s$ is the sparsity level, $n$ is the sample size and $\beta$ is the smoothness of the conditional density of $X$ given the response $Y$ and the covariates $\bZ$. The convergence rate is nonstandard and slower than that in the classical Lasso problems. Furthermore, we prove that the resulting estimator is minimax rate optimal up to a logarithmic factor. The Lepski's method is developed to achieve the adaption to the unknown sparsity $s$ and smoothness $\beta$. Computationally, an efficient path-following algorithm is proposed to compute the solution path. We show that this algorithm achieves geometric rate of convergence for computing the whole path. Finally, we evaluate the finite sample performance of the proposed estimator in simulation studies and a real data analysis from the ChAMP (Chondral Lesions And Meniscus Procedures) Trial.

Thursday, October 31, 2019 — 4:00 PM EDT

#### Variable selection for structured high-dimensional data using known and novel graph information

Variable selection for structured high-dimensional covariates lying on an underlying graph has drawn considerable interest. However, most of the existing methods may not be scalable to high dimensional settings involving tens of thousands of variables lying on known pathways such as the case in genomics studies, and they assume that the graph information is fully known. This talk will focus on addressing these two challenges. In the first part, I will present an adaptive Bayesian shrinkage approach which incorporates known graph information through shrinkage parameters and is scalable to high dimensional settings (e.g., p~100,000 or millions). We also establish theoretical properties of the proposed approach for fixed and diverging p. In the second part, I will tackle the issue that graph information is not fully known. For example, the role of miRNAs in regulating gene expression is not well-understood and the miRNA regulatory network is often not validated. We propose an approach that treats unknown graph information as missing data (i.e. missing edges), introduce the idea of imputing the unknown graph information, and define the imputed information as the novel graph information. In addition, we propose a hierarchical group penalty to encourage sparsity at both the pathway level and the within-pathway level, which, combined with the imputation step, allows for incorporation of known and novel graph information. The methods are assessed via simulation studies and are applied to analyses of cancer data.

Friday, October 25, 2019 — 10:30 AM EDT

**On the properties of Lambda-quantiles**

We present a systematic treatment of Lambda-quantiles, a family of generalized quantiles introduced in Frittelli et al. (2014) under the name of Lambda Value at Risk. We consider various possible definitions and derive their fundamental properties, mainly working under the assumption that the threshold function Lambda is nonincreasing. We refine some of the weak continuity results derived in Burzoni et al. (2017), showing that the weak continuity properties of Lambda-quantiles are essentially similar to those of the usual quantiles. Further, we provide an axiomatic foundation for Lambda-quantiles based on a locality property that generalizes a similar axiomatization of the usual quantiles based on the ordinal covariance property given in Chambers (2009). We study scoring functions consistent with Lambda-quantiles and as an extension of the usual quantile regression we introduce Lambda-quantile regression, of which we provide two financial applications.

(joint work with Ilaria Peri).

Friday, October 18, 2019 — 8:00 AM to Saturday, October 19, 2019 — 5:00 PM EDT

Thursday, October 17, 2019 — 4:00 PM EDT

**Building Deep Statistical Thinking for Data Science 2020: Privacy Protected Census, Gerrymandering, and Election**** **

The year 2020 will be a busy one for statisticians and more generally data scientists. The US Census Bureau has announced that the data from the 2020 Census will be released under differential privacy (DP) protection, which in layperson’s terms means adding some noises to the data. While few would argue against protecting data privacy, many researchers, especially from the social sciences, are concerned whether the right trade-offs between data privacy and data utility are being made. The DP protection also has direct impact on redistricting, an issue that is already complicated enough with accurate counts, due to the need of guarding against excessive gerrymandering. The central statistical problem there is a rather unique one: how to determine whether a realization is an outlier with respect to a null distribution, when that null distribution itself cannot be fully determined? The 2020 US election will be another highly watched event, with many groups already busy making predictions. Will the lessons from predicting the 2016 US election be learned, or the failure be repeated? This talk invites the audience on a journey of deep statistical thinking prompted by these questions, regardless whether they have any interest in the US Census or politics.

Tuesday, October 15, 2019 — 4:00 PM EDT

**Graphical Models and Structural Learning for Extremes**

Conditional independence, graphical models and sparsity are key notions for parsimonious models in high dimensions and for learning structural relationships in the data. The theory of multivariate and spatial extremes describes the risk of rare events through asymptotically justified limit models such as max-stable and multivariate Pareto distributions. Statistical modeling in this field has been limited to moderate dimensions so far, owing to complicated likelihoods and a lack of understanding of the underlying probabilistic structures.

We introduce a general theory of conditional independence for multivariate Pareto distributions that allows to define graphical models and sparsity for extremes. New parametric models can be built in a modular way and statistical inference can be simplified to lower-dimensional margins. We define the extremal variogram, a new summary statistics that turns out to be a tree metric and therefore allows to efficiently learn an underlying tree structure through Prim's algorithm. For a popular parametric class of multivariate Pareto distributions we show that, similarly to the Gaussian case, the sparsity pattern of a general graphical model can be easily read of from suitable inverse covariance matrices. This enables the definition of an extremal graphical lasso that enforces sparsity in the dependence structure. We illustrate the results with an application to flood risk assessment on the Danube river.

This is joint work with Adrien Hitz. Preprint available on \texttt{https://arxiv.org/abs/1812.01734}.

Friday, October 11, 2019 — 10:30 AM EDT

**Precision Factor Investing: Avoiding Factor Traps by Predicting Heterogeneous Effects of Firm Characteristics**

We apply ideas from causal inference and machine learning to estimate the sensitivity of future stock returns to observable characteristics like size, value, and momentum. By analogy with the informal notion of a "value trap," we distinguish "characteristic traps" (stocks with weak sensitivity) from "characteristic responders" (those with strong sensitivity). We classify stocks by interpreting these distinctions as heterogeneous treatment effects (HTE), with characteristics interpreted as treatments and future returns interpreted as responses. The classification exploits a large set of stock features and recent work applying machine learning to HTE. Long-short strategies based on sorting stocks on characteristics perform significantly better when applied to characteristic responders than traps. A strategy based on the difference between these long-short returns profits from the predictability of HTE rather than from factors associated with the characteristics themselves. This is joint work with Pu He.

Thursday, October 10, 2019 — 4:00 PM EDT

**Estimating Time-Varying Directed Networks**

The problem of modeling the dynamical regulation process within a gene network has been of great interest for a long time. We propose to model this dynamical system with a large number of nonlinear ordinary differential equations (ODEs), in which the regulation function is estimated directly from data without any parametric assumption. Most current research assumes the gene regulation network is static, but in reality, the connection and regulation function of the network may change with time or environment. This change is reflected in our dynamical model by allowing the regulation function varying with the gene expression and forcing this regulation function to be zero if no regulation happens. We introduce a statistical method called functional SCAD to estimate a time-varying sparse and directed gene regulation network, and simultaneously, to provide a smooth estimation of the regulation function and identify the interval in which no regulation effect exists. The finite sample performance of the proposed method is investigated in a Monte Carlo simulation study. Our method is demonstrated by estimating a time-varying directed gene regulation network of 20 genes involved in muscle development during the embryonic stage of Drosophila melanogaster.

Thursday, October 3, 2019 — 4:00 PM EDT

**Real World EHR Big Data: Challenges and Opportunities**

The real world EHR and health care Big Data may bring a revolutionary thinking on how to evaluate therapeutic treatments and clinical pathways in a real world setting. Big EHR data may also allow us to identify specific patient populations for a specific treatment so that the concept of personalized treatment can be implemented and deployed directly on the EHR system. However, it is quite challenging to use the real world data in treatment assessment and disease predictions due to various reasons. In this talk, I will share our experiences on EHR and health care Big Data research. First, I will discuss the basic infrastructure and multi-disciplinary team that is necessary in order to deal with the EHR data. Then I will use an example of subarachnoid hemorrhage (SAH) study to demonstrate a procedure with eight steps that we have developed to use EHR data for research purpose. In particular, the EHR data extraction, cleaning, pre-processing and preparation are the major steps that require more novel statistical methods to deal with. Finally I will discuss the challenges and opportunities for statisticians to use EHR data for research.

Thursday, September 26, 2019 — 4:15 PM EDT

**Optimal Transport, Entropy, and Risk Measures on Wiener space**

We discuss the interplay between entropy, large deviations, and optimal couplings on Wiener space.

In particular we prove a new rescaled version of Talagrand’s transport inequality. As an application, we consider rescaled versions of the entropic risk measure which are sensitive to risks in the fine structure of Brownian paths.

Thursday, September 19, 2019 — 4:00 PM EDT

**Simulation Optimization under Input Model Uncertainty**

Simulation optimization is concerned with identifying the best solution for large, complex and stochastic physical systems via computer simulation models. Its applications span across various fields such as transportation, finance, power, and healthcare. A stochastic simulation model is driven by a set of distributions, known as “input model”. However, since these distributions are usually estimated using finite real-world data, the simulation output is subject to the so-called “input model uncertainty”. Ignoring input uncertainty can cause a higher risk of selecting an inferior solution in simulation optimization. In this talk, I will first present a new framework called Bayesian Risk Optimization (BRO) that hedges against the risk of input uncertainty in simulation optimization. Then I will focus on the problem of optimizing over a finite solution space, a problem known as Ranking and Selection in statistics literature, or Best-Arm Identification in Multi-Armed Bandits literature, and present two new algorithms that can handle input uncertainty.

Friday, September 13, 2019 — 10:30 AM EDT

**Robust Distortion Risk Measures**

In the presence of uncertainty, robustness of risk measures, which are prominent tools for the assessment of financial risks, is of crucial importance. Distributional uncertainty may be accounted for by providing bounds on the values of a risk measure, so-called worst- and best-case risk measures. Worst (best)-case risk measures are determined as the maximal (minimal) value a risk measure can attain when the underlying distribution is unknown – typically up to its first moments. However, these bounds as well as the (worst- and best-case) distributions that attain the worst- and best-case values are too large, respectively “unrealistic”, to be practically relevant.

We provide sharp bounds for the class of distortion risk measures with constraints on the first two moments combined with a constraint on the Wasserstein distance with respect to a reference distribution. Adding the Wasserstein distance constraint, leads to significantly improved bounds and more “realistic” worst-case distributions. Specifically, the worst-case distribution of the two most widely used risk measures, the Value-at-Risk and the Tail-Value-at-Risk, depend on the reference distribution and thus, are no longer two-point distributions.

This is a join publication by Carole Bernard, Silvana M. Pesenti, Steven Vanduffel

Thursday, September 12, 2019 — 4:00 PM EDT

**Nonparametric failure time with Bayesian Additive Regression Trees**

Bayesian Additive Regression Trees (BART) is a nonparametric machine learning method for continuous, dichotomous, categorical and time-to-event outcomes. However, survival analysis with BART currently presents some challenges. Two current approaches each have their pros and cons. Our discrete time approach is free of precarious restrictive assumptions such as proportional hazards and Accelerated Failure Time (AFT), but it becomes increasingly computationally demanding as the sample size increases. Alternatively, a Dirichlet Process Mixture approach is computationally friendly, but it suffers from the AFT assumption. Therefore, we propose to further nonparametrically enhance this latter approach via heteroskedastic BART which will remove the restrictive AFT assumption while maintaining its desirable computational properties.

Thursday, August 22, 2019 — 4:00 PM EDT

**Development and Application of A Measure of Prediction Accuracy for Binary and Censored Time to Event Data**

Clinical preventive care often uses risk scores to screen population for high risk patients for targeted intervention. Typically the prevalence is low, meaning extremely unbalanced classes. Positive predictive value and true positive fraction have been recognized as relevant metrics in this imbalanced setting. However, for commonly used continuous or ordinal risk scores, these measures require a subjective cut-off threshold value to dichotomize and predict class membership. In this talk, I describe a summary index of positive predictive value (AP) for binary and event time outcome data. Similar to the widely used AUC, AP is rank based and a semi-proper scoring rule. We also study the behavior of incremental values of AUC, AP and the strict proper scoring rule scaled Brier score (sBrier) when an additional risk factor Z is included. It is shown that the incremental values agreement between AP and sBrier increases as the class unbalance increases, while the agreement between AUC and sBrier decreases as class unbalance increases. Under certain configurations, the changes in AP and sBrier indicate worse prediction performance when Z is added to the risk profile, while the changes in AUC are almost always favor the addition of Z. Several real world examples are used throughout the talk to illustrate and contrast these metrics.

Tuesday, August 13, 2019 — 4:00 PM EDT

**Spatial Cauchy processes with local tail dependence**

We study a class of models for spatial data obtained using Cauchy convolution processes with random indicator kernel functions. We show that the resulting spatial processes have some appealing dependence properties including tail dependence at smaller distances and asymptotic independence at larger distances. We derive extreme-value limits of these processes and consider some interesting special cases. We show that estimation is feasible in high dimensions and the proposed class of models allows for a wide range of dependence structures.