Thursday, January 30, 2020 — 10:00 AM EST

**Batch-mode active learning for regression and its application to the valuation of large variable annuity portfolios**

Supervised learning algorithms require a sufficient amount of labeled data to construct an accurate predictive model. In practice, collecting labeled data may be extremely time-consuming while unlabeled data can be easily accessed. In a situation where labeled data are insufficient for a prediction model to perform well and the budget for an additional data collection is limited, it is important to effectively select objects to be labeled based on whether they contribute to a great improvement in the model's performance. In this talk, I will focus on the idea of active learning that aims to train an accurate prediction model with minimum labeling cost. In particular, I will present batch-mode active learning for regression problems. Based on random forest, I will propose two effective random sampling algorithms that consider the prediction ambiguities and diversities of unlabeled objects as measures of their informativeness. Empirical results on an insurance data set demonstrate the effectiveness of the proposed approaches in valuing large variable annuity portfolios (which is a practical problem in the actuarial field). Additionally, comparisons with the existing framework that relies on a sequential combination of unsupervised and supervised learning algorithms are also investigated.

Wednesday, January 29, 2020 — 10:00 AM EST

#### Marginal analysis of multiple outcomes with informative cluster size

Periodontal disease is a serious infection of the gums and the bones surrounding the teeth. In Veterans Affairs Dental Longitudinal Study (VADLS), the relationships between periodontal disease and other health and socioeconomic conditions are of interest. To determine whether or not a patient has periodontal disease, multiple clinical measurements (clinical attachment loss, alveolar bone loss, tooth mobility) are taken at the tooth-level. However, a universal definition for periodontal disease does not exist and researchers often create a composite outcome from these measurements or analyze each outcome separately. Moreover, patients have varying number of teeth, with those that are more prone to the disease having fewer teeth compared to those with good oral health. Such dependence between the outcome of interest and cluster size (number of teeth) is called informative cluster size, and results obtained from fitting conventional marginal models can be biased. In this talk, I will introduce a novel method to jointly analyze multiple correlated outcomes for clustered data with informative cluster size using the class of generalized estimating equations (GEE) with cluster-specific weights. Using the data from VADLS, I will compare the results obtained from the proposed multivariate outcome cluster-weighted GEE to those from the conventional unweighted GEE. Finally, I will discuss a few other research settings where data may exhibit informative cluster size.

Monday, January 27, 2020 — 10:00 AM EST

#### Network Analysis of the Brain: from Generative Modeling to Multilayer Network Embedding of Functional Connectivity Data

Recent large-scale projects in neuroscience, such as the Human Connectome Project and the BRAIN initiative, emphasize the need of new statistical and computational techniques for analyzing functional connectivity within and across populations. Network-based models have greatly improved our understanding of brain structure and function, yet many important challenges remain. In this talk, I will consider two particularly important challenges: i) how does one characterize the generative mechanisms of functional connectivity, and ii) how does one identify discriminatory features among connectivity scans over disparate populations? To address the first challenge, I propose and describe a generative network model, called the correlation generalized exponential random graph model (cGERGM), that flexibly characterizes the joint network topology of correlation networks arising in functional connectivity. The model is the first of its kind to directly assess the network structure of a correlation network while simultaneously handling the mathematical constraints of a correlation matrix. I apply the cGERGM to resting state fMRI data from healthy individuals in the Human Connectome Project. The cGERGM reveals remarkably consistent organizational properties guiding subnetwork architecture, suggesting a fundamental organizational basis for subnetwork communication that differs from previous beliefs.

For the second challenge, I focus on learning interpretable features from complex multilayer networks arising in population studies of functional connectivity. I will introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that learns continuous node feature representations from multilayer networks. The multi-node2vec algorithm identifies maximum likelihood estimators of nodal features through the use of the Skip-gram neural network model. Asymptotic analysis of the algorithm reveals that it is a fast approximation to a multi-dimensional non-negative matrix factorization applied to a weighted average of the layers in the multilayer network. I apply multi-node2vec to a multilayer functional brain network from resting state fMRI scans over a population of 74 healthy individuals and 70 patients with varying degrees of schizophrenia. The identified functional embeddings closely associate with the functional organization of the brain and offer important insights into the differences between patient and healthy groups that is well-supported by theory.

Friday, January 24, 2020 — 10:00 AM EST

**Clustering and Classification of Three-Way Data**

Clustering and classification is the process of finding and analyzing underlying group structure in heterogenous data and is fundamental to computational statistics and machine learning. In the past, relatively simple techniques could be used for clustering; however, with data becoming increasingly complex, these methods are oftentimes not advisable, and in some cases not possible. One such such example is the analysis of three-way data where each data point is represented as a matrix instead of a traditional vector. Examples of three-way include greyscale images and multivariate longitudinal data. In this talk, recent methods for clustering three-way data will be presented including high-dimensional and skewed three-way data. Both simulated and real data will be used for illustration and future directions and extensions will be discussed.

Wednesday, January 22, 2020 — 10:00 AM EST

**The possibility of nearly assumption-free inference in causal inference**

In causal effect estimation, the state-of-the-art is the so-called double machine learning (DML) estimators, which combine the benefit of doubly robust estimation, sample splitting and using machine learning methods to estimate nuisance parameters. The validity of the confidence interval associated with a DML estimator, in most part, relies on the complexity of nuisance parameters and how close the machine learning estimators are to the nuisance parameters. Before we have a complete understanding of the theory of many machine learning methods including deep neural networks, even a DML estimator may have a bias so large that prohibits valid inference. In this talk, we describe a nearly assumption-free procedure that can either criticize the invalidity of the Wald confidence interval associated with the DML estimators of some causal effect of interest or falsify the certificates (i.e. the mathematical conditions) that, if true, could ensure valid inference. Essentially, we are testing the null hypothesis that if the bias of an estimator is smaller than a fraction $\rho$ its standard error. Our test is valid under the null without requiring any complexity (smoothness or sparsity) assumptions on the nuisance parameters or the properties of machine learning estimators and may have power to inform the analysts that they have to do something else than DML estimators or Wald confidence intervals for inference purposes. This talk is based on joint work with Rajarshi Mukherjee and James M. Robins.

Tuesday, January 21, 2020 — 10:00 AM EST

**Diagnostics for Regression Models with Discrete Outcomes**

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard residuals such as Pearson and deviance residuals for such outcomes often show a large discrepancy from the hypothesized pattern even under the true model and are not informative especially when data are highly discrete. To fill this gap, we propose a surrogate empirical residual distribution function for general discrete (e.g. ordinal and count) outcomes that serves as an alternative to the empirical Cox-Snell residual distribution function. When at least one continuous covariate is available, we show asymptotically that the proposed function converges uniformly to the identity function under the correctly specified model, even with highly discrete (e.g. binary) outcomes. Through simulation studies, we demonstrate empirically that the proposed surrogate empirical residual distribution function is highly effective for various diagnostic tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification.

Monday, January 20, 2020 — 10:00 AM EST

**Sufficient Dimension Reduction for Populations with Structured Heterogeneity**

Risk modeling has become a crucial component in the effective delivery of health care. A key challenge in building effective risk models is accounting for patient heterogeneity among the diverse populations present in health systems. Incorporating heterogeneity based on the presence of various comorbidities into risk models is crucial for the development of tailored care strategies, as it can provide patient-centered information and can result in more accurate risk prediction. Yet, in the presence of high dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. Towards this aim, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach can improve estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumption. We demonstrate the utility of our approach in the prediction of hospital admission risk for a large health system when tested on further follow-up data.

Thursday, January 16, 2020 — 10:00 AM EST

**Adapting black-box machine learning methods for causal inference**

I'll cover two recent works on the use of deep learning for causal inference with observational data. The setup for the problem is: we have an observational dataset where each observation includes a treatment, an outcome, and covariates (confounders) that may affect the treatment and outcome. We want to estimate the causal effect of the treatment on the outcome; that is, what happens if we intervene? This effect is estimated by adjusting for the covariates. The talk covers two aspects of using of deep learning for this adjustment.

First, neural network research has focused on \emph{predictive} performance, but our goal is to produce a quality \emph{estimate} of the effect. I'll describe two adaptations to neural net design and training, based on insights from the statistical literature on the estimation of treatment effects. The first is a new architecture, the Dragonnet, that exploits the sufficiency of the propensity score for estimation adjustment. The second is a regularization procedure, targeted regularization, that induces a bias towards estimates that have non-parametrically optimal asymptotic properties.

Second, I'll describe how to use deep language models (e.g., BERT) for causal inference with text data. The challenge here is that text data is high dimensional, and naive dimension reduction may throw away information required for causal identification. The main insight is that the text representation produced by deep embedding methods suffices for the causal adjustment.

Tuesday, January 14, 2020 — 10:00 AM EST

**Statistical Inference for Multi-View Clustering**

In the multi-view data setting, multiple data sets are collected on a single, common set of observations. For example, we might perform genomic and proteomic assays on a single set of tumour samples, or we might collect relationship data from two online social networks for a single set of users. It is tempting to cluster the observations using all of the data views, in order to fully exploit the available information. However, clustering the observations using all of the data views implicitly assumes that a single underlying clustering of the observations is shared across all data views. If this assumption does not hold, then clustering the observations using all data views may lead to spurious results. We seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop new tests for answering this question based on multivariate and/or network data views, and apply them to multi-omics data from the Pioneer 100 Wellness Study (Price and others, 2017) and protein-protein interaction data from the HINT database (Das and Yu, 2012). We will also briefly discuss our current work on testing for no difference between the means of two estimated clusters in a single-view data set. This is joint work with Jacob Bien (University of Southern California) and Daniela Witten (University of Washington).

Monday, January 13, 2020 — 10:00 AM EST

**Sampling 'hard-to-reach' populations: recent developments**

In this talk, I will present some recent methodological developments in capture-recapture methods and Respondent-Driven Sampling (RDS).

In capture-recapture methods, our work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location one can create a capture-recapture data set about devices, or more specifically their users, that "visited" the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator were proposed.

RDS is a variant of link-tracing, a sampling technique for surveying hard-to-reach communities that takes advantage of community members' social networks to reach potential participants. While the RDS sampling mechanism and associated methods of adjusting for the sampling at the analysis stage are well-documented in the statistical sciences literature, methodological focus has largely been restricted to estimation of population means and proportions (e.g.~prevalence). As a network-based sampling method, RDS is faced with the fundamental problem of sampling from population networks where features such as homophily and differential activity (two measures of tendency for individuals with similar traits to share social links) are sensitive to the choice of a simulation and sampling method. In this work, *(i)* we present strategies for simulating RDS samples with known network and sample characteristics, so as to provide a foundation from which to expand the study of RDS analyses beyond the univariate framework and *(ii)* embed RDS within a causal inference framework and determine conditions under which average causal effects can be estimated. The proposed methodology will constitute a unifying approach that deals with simple estimands (means and proportions), with a natural extension to the study of associational and causal questions.

Friday, January 10, 2020 — 10:00 AM EST

#### Global and local estimation of low-rank random graphs

Random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk I will give two perspectives on the estimation task of low-rank random graphs. Specifically, I will focus on estimating the latent positions in random dot product graphs. The first component of the talk focuses on the global estimation task. The minimax lower bound for global estimation of the latent positions is established, and this minimax lower bound is achieved by a Bayes procedure, referred to as the posterior spectral embedding. The second component of the talk addresses the local estimation task. We define local efficiency in estimating each individual latent position, propose a novel one-step estimator that takes advantage of the curvature information of the likelihood function (i.e., derivatives information) of the graph model, and show that this estimator is locally efficient. The previously widely adopted adjacency spectral embedding is proven to be locally inefficient due to the ignorance of the curvature information of the likelihood function. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

Thursday, January 9, 2020 — 10:00 AM EST

#### Methodological Problems in Multigenerational Epidemiology

While epidemiology has typically focused on how exposures impact the individuals directly exposed, recent interest has been shown in investigating exposures with multigenerational effects—ones that affect the children and grandchildren of those directly exposed. For example, a recent motivating study examined the association between maternal in-utero diethylstilbestrol exposure and ADHD in the Nurses Health Study II. Such multigenerational studies, however, are susceptible to informative cluster size, occurring when the number of children to a mother (the cluster size) is related to their outcomes. But what if some women have no children at all? We first consider this problem of informatively empty clusters. Second, observing populations across multiple generations can be prohibitively expensive, so multigenerational studies often measure exposures retrospectively—and hence are susceptible to misclassification and recall bias. We thus study the impact of exposure misclassification when cluster size is potentially informative, as well as when misclassification is differential by cluster size. Finally, outside the relative control of laboratory settings, population-based multigenerational studies have had to entertain a broad range of study designs. We show that these designs have important implications on the scope of scientific inquiry, and we highlight areas in need of further methodological research.

Tuesday, January 7, 2020 — 10:00 PM EST

#### Renewable Estimation and Incremental Inference in Streaming Data Analysis

New data collection and storage technologies have given rise to a new field of streaming data analytics, including real-time statistical methodology for online data analyses. Streaming data refers to high-throughput recordings with large volumes of observations gathered sequentially and perpetually over time. Such type of data includes national disease registry, mobile health, and disease surveillance, among others. This talk primarily concerns the development of a fast real-time statistical estimation and inference method for regression analysis, with a particular objective of addressing challenges in streaming data storage and computational efficiency. Termed as renewable estimation, this method enjoys strong theoretical guarantees, including both asymptotic unbiasedness and estimation efficiency, and fast computational speed. The key technical novelty pertains to the fact that the proposed method uses current data and summary statistics of historical data. The proposed algorithm will be demonstrated in generalized linear models (GLM) for cross-sectional data. I will discuss both conceptual understanding and theoretical guarantees of the method and illustrate its performance via numerical examples. This is joint work with my supervisor Professor Peter Song.

Monday, January 6, 2020 — 10:00 AM EST

#### Navigation and Evaluation of Latent Structure in High-Dimensional Data

In the modern data analysis paradigm, fitting models is easy, but knowing how to design or evaluate them is difficult. In this talk, we will adapt insights from graphical statistics and goodness-of-fit testing to modern problems, illustrating them with applications to microbiome genomics and climate systems science.

For the microbiome, we show how linking complementary displays can make it easy to query structure in raw data. We also find novel visual summaries that inform model criticism more deeply than data splitting strategies alone. We then describe how artificial intelligence can be used to accelerate climate simulations, and introduce techniques for characterizing goodness-of-fit of the resulting models.

Viewed broadly, these projects provide opportunities for human interaction in the automated data processing regime, facilitating (1) streamlined navigation of data and (2) critical evaluation of models.