Friday, June 5, 2020 — 10:30 AM EDT

TBA

Friday, May 15, 2020 — 10:30 AM EDT

TBA

Friday, May 1, 2020 — 3:00 PM to Sunday, May 3, 2020 — 6:00 PM EDT

Friday, May 1, 2020 — 10:30 AM EDT

TBA

Friday, April 17, 2020 — 10:30 AM EDT

TBA

Friday, April 3, 2020 — 10:30 AM EDT

TBA

Friday, March 20, 2020 — 10:30 AM EDT

TBA

Friday, March 6, 2020 — 10:30 AM EST

TBA

Friday, February 21, 2020 — 10:30 AM EST

To be announced.

Friday, February 7, 2020 — 10:00 AM EST

**The Extended Reproducibility Phenotype - Re-framing and Generalizing Computational Reproducibility**

Computational reproducibility has become a crucial part of how data analytic results are understood and assessed both in and outside of academia. Less work, however, has explored whether these strict computational reproducibility criteria are necessary or sufficient to actually meet our needs as consumers of analysis results. I will show that in principle they are neither. I will present two inter-related veins of work. First, I will provide a conceptual reframing of the concepts of strict reproducibility, and the actions analysts take to ensure it, in terms of our ability to actually trust the results and the claims about the underlying data-generating systems they embody. Second, I will present a generalized conception of reproducibily by introducing the concepts of Currency, Comparability and Completeness and their oft-overlooked importance to assessing data analysis results.

Thursday, February 6, 2020 — 10:00 AM EST

**Censoring Unbiased Regression Trees and Ensembles**

Tree-based methods are useful tools to identify risk groups and conduct prediction by employing recursive partitioning to separate subjects into different risk groups. We propose a novel paradigm of building regression trees for censored data in survival analysis. We prudently construct the censored-data loss function through an extension of the theory of censoring unbiased transformations. With the construction, we can conveniently implement the proposed regression trees algorithm using existing software for the Classification and Regression Trees algorithm (e.g., rpart package in R) and extend it for ensemble learning. Simulations and real data examples demonstrate that our methods either improve upon or remain competitive with existing tree-based algorithms for censored data.

Wednesday, February 5, 2020 — 10:00 AM EST

**Detecting the Signal Among Noise and Contamination in High Dimensions**

Improvements in biomedical technology and a surge in other data-driven sciences lead to the collection of increasingly large amounts of data. In this affluence of data, contamination is ubiquitous but often neglected, creating substantial risk of spurious scientific discoveries. Especially in applications with high-dimensional data, for instance proteomic biomarker discovery, the impact of contamination on methods for variable selection and estimation can be profound yet difficult to diagnose.

In this talk I present a method for variable selection and estimation in high-dimensional linear regression models, leveraging the elastic-net penalty for complex data structures. The method is capable of harnessing the collected information even in the presence of arbitrary contamination in the response and the predictors. I showcase the method’s theoretical and practical advantages, specifically in applications with heavy-tailed errors and limited control over the data. I outline efficient algorithms to tackle computational challenges posed by inherently non-convex objective functions of robust estimators and practical strategies for hyper-parameter selection, ensuring scalability of the method and applicability to a wide range of problems.

Tuesday, February 4, 2020 — 10:00 AM EST

**Bayesian Utility-Based Toxicity Probability Interval Design for Dose Finding in Phase I/II Trials**

Molecularly targeted agents and immunotherapy have revolutionized modern cancer treatment. Unlike chemotherapy, the maximum tolerated dose of the targeted therapies may not pose significant clinical benefit over the lower doses. By simultaneously considering both binary toxicity and efficacy endpoints, phase I/II trials can identify a better dose for subsequent phase II trials than traditional phase I trials in terms of efficacy-toxicity tradeoff. Existing phase I/II dose-finding methods are model-based or need to pre-specify many design parameters, which makes them difficult to implement in practice. To strengthen and simplify the current practice of phase I/II trials, we propose a utility-based toxicity probability interval (uTPI) design for finding the optimal biological dose (OBD) where binary toxicity and efficacy endpoints are observed. The uTPI design is model-assisted in nature, simply modeling the utility outcomes observed at the current dose level based on a quasibinomial likelihood. Toxicity probability intervals are used to screen out overly toxic dose levels, and then the dose escalation/de-escalation decisions are made adaptively by comparing the posterior utility distributions of the adjacent levels of the current dose. The uTPI design is flexible in accommodating various utility functions while only needs minimum design parameters. A prominent feature of the uTPI design is that it has a simple decision structure such that a concise dose-assignment decision table can be calculated before the start of trial and be used throughout the trial, which greatly simplifies practical implementation of the design. Extensive simulation studies demonstrate that the proposed uTPI design yields desirable as well as robust performance under various scenarios. This talk is based on the joint work with Ruitao Lin and Ying Yuan at MD Anderson Cancer Center.

Thursday, January 30, 2020 — 10:00 AM EST

**Batch-mode active learning for regression and its application to the valuation of large variable annuity portfolios**

Supervised learning algorithms require a sufficient amount of labeled data to construct an accurate predictive model. In practice, collecting labeled data may be extremely time-consuming while unlabeled data can be easily accessed. In a situation where labeled data are insufficient for a prediction model to perform well and the budget for an additional data collection is limited, it is important to effectively select objects to be labeled based on whether they contribute to a great improvement in the model's performance. In this talk, I will focus on the idea of active learning that aims to train an accurate prediction model with minimum labeling cost. In particular, I will present batch-mode active learning for regression problems. Based on random forest, I will propose two effective random sampling algorithms that consider the prediction ambiguities and diversities of unlabeled objects as measures of their informativeness. Empirical results on an insurance data set demonstrate the effectiveness of the proposed approaches in valuing large variable annuity portfolios (which is a practical problem in the actuarial field). Additionally, comparisons with the existing framework that relies on a sequential combination of unsupervised and supervised learning algorithms are also investigated.

Wednesday, January 29, 2020 — 10:00 AM EST

#### Marginal analysis of multiple outcomes with informative cluster size

Periodontal disease is a serious infection of the gums and the bones surrounding the teeth. In Veterans Affairs Dental Longitudinal Study (VADLS), the relationships between periodontal disease and other health and socioeconomic conditions are of interest. To determine whether or not a patient has periodontal disease, multiple clinical measurements (clinical attachment loss, alveolar bone loss, tooth mobility) are taken at the tooth-level. However, a universal definition for periodontal disease does not exist and researchers often create a composite outcome from these measurements or analyze each outcome separately. Moreover, patients have varying number of teeth, with those that are more prone to the disease having fewer teeth compared to those with good oral health. Such dependence between the outcome of interest and cluster size (number of teeth) is called informative cluster size, and results obtained from fitting conventional marginal models can be biased. In this talk, I will introduce a novel method to jointly analyze multiple correlated outcomes for clustered data with informative cluster size using the class of generalized estimating equations (GEE) with cluster-specific weights. Using the data from VADLS, I will compare the results obtained from the proposed multivariate outcome cluster-weighted GEE to those from the conventional unweighted GEE. Finally, I will discuss a few other research settings where data may exhibit informative cluster size.

Monday, January 27, 2020 — 10:00 AM EST

#### Network Analysis of the Brain: from Generative Modeling to Multilayer Network Embedding of Functional Connectivity Data

Recent large-scale projects in neuroscience, such as the Human Connectome Project and the BRAIN initiative, emphasize the need of new statistical and computational techniques for analyzing functional connectivity within and across populations. Network-based models have greatly improved our understanding of brain structure and function, yet many important challenges remain. In this talk, I will consider two particularly important challenges: i) how does one characterize the generative mechanisms of functional connectivity, and ii) how does one identify discriminatory features among connectivity scans over disparate populations? To address the first challenge, I propose and describe a generative network model, called the correlation generalized exponential random graph model (cGERGM), that flexibly characterizes the joint network topology of correlation networks arising in functional connectivity. The model is the first of its kind to directly assess the network structure of a correlation network while simultaneously handling the mathematical constraints of a correlation matrix. I apply the cGERGM to resting state fMRI data from healthy individuals in the Human Connectome Project. The cGERGM reveals remarkably consistent organizational properties guiding subnetwork architecture, suggesting a fundamental organizational basis for subnetwork communication that differs from previous beliefs.

For the second challenge, I focus on learning interpretable features from complex multilayer networks arising in population studies of functional connectivity. I will introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that learns continuous node feature representations from multilayer networks. The multi-node2vec algorithm identifies maximum likelihood estimators of nodal features through the use of the Skip-gram neural network model. Asymptotic analysis of the algorithm reveals that it is a fast approximation to a multi-dimensional non-negative matrix factorization applied to a weighted average of the layers in the multilayer network. I apply multi-node2vec to a multilayer functional brain network from resting state fMRI scans over a population of 74 healthy individuals and 70 patients with varying degrees of schizophrenia. The identified functional embeddings closely associate with the functional organization of the brain and offer important insights into the differences between patient and healthy groups that is well-supported by theory.

Friday, January 24, 2020 — 10:00 AM EST

**Clustering and Classification of Three-Way Data**

Clustering and classification is the process of finding and analyzing underlying group structure in heterogenous data and is fundamental to computational statistics and machine learning. In the past, relatively simple techniques could be used for clustering; however, with data becoming increasingly complex, these methods are oftentimes not advisable, and in some cases not possible. One such such example is the analysis of three-way data where each data point is represented as a matrix instead of a traditional vector. Examples of three-way include greyscale images and multivariate longitudinal data. In this talk, recent methods for clustering three-way data will be presented including high-dimensional and skewed three-way data. Both simulated and real data will be used for illustration and future directions and extensions will be discussed.

Wednesday, January 22, 2020 — 10:00 AM EST

**The possibility of nearly assumption-free inference in causal inference**

In causal effect estimation, the state-of-the-art is the so-called double machine learning (DML) estimators, which combine the benefit of doubly robust estimation, sample splitting and using machine learning methods to estimate nuisance parameters. The validity of the confidence interval associated with a DML estimator, in most part, relies on the complexity of nuisance parameters and how close the machine learning estimators are to the nuisance parameters. Before we have a complete understanding of the theory of many machine learning methods including deep neural networks, even a DML estimator may have a bias so large that prohibits valid inference. In this talk, we describe a nearly assumption-free procedure that can either criticize the invalidity of the Wald confidence interval associated with the DML estimators of some causal effect of interest or falsify the certificates (i.e. the mathematical conditions) that, if true, could ensure valid inference. Essentially, we are testing the null hypothesis that if the bias of an estimator is smaller than a fraction $\rho$ its standard error. Our test is valid under the null without requiring any complexity (smoothness or sparsity) assumptions on the nuisance parameters or the properties of machine learning estimators and may have power to inform the analysts that they have to do something else than DML estimators or Wald confidence intervals for inference purposes. This talk is based on joint work with Rajarshi Mukherjee and James M. Robins.

Tuesday, January 21, 2020 — 10:00 AM EST

**Diagnostics for Regression Models with Discrete Outcomes**

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard residuals such as Pearson and deviance residuals for such outcomes often show a large discrepancy from the hypothesized pattern even under the true model and are not informative especially when data are highly discrete. To fill this gap, we propose a surrogate empirical residual distribution function for general discrete (e.g. ordinal and count) outcomes that serves as an alternative to the empirical Cox-Snell residual distribution function. When at least one continuous covariate is available, we show asymptotically that the proposed function converges uniformly to the identity function under the correctly specified model, even with highly discrete (e.g. binary) outcomes. Through simulation studies, we demonstrate empirically that the proposed surrogate empirical residual distribution function is highly effective for various diagnostic tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification.

Monday, January 20, 2020 — 10:00 AM EST

**Sufficient Dimension Reduction for Populations with Structured Heterogeneity**

Risk modeling has become a crucial component in the effective delivery of health care. A key challenge in building effective risk models is accounting for patient heterogeneity among the diverse populations present in health systems. Incorporating heterogeneity based on the presence of various comorbidities into risk models is crucial for the development of tailored care strategies, as it can provide patient-centered information and can result in more accurate risk prediction. Yet, in the presence of high dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. Towards this aim, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach can improve estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumption. We demonstrate the utility of our approach in the prediction of hospital admission risk for a large health system when tested on further follow-up data.

Thursday, January 16, 2020 — 10:00 AM EST

**Adapting black-box machine learning methods for causal inference**

I'll cover two recent works on the use of deep learning for causal inference with observational data. The setup for the problem is: we have an observational dataset where each observation includes a treatment, an outcome, and covariates (confounders) that may affect the treatment and outcome. We want to estimate the causal effect of the treatment on the outcome; that is, what happens if we intervene? This effect is estimated by adjusting for the covariates. The talk covers two aspects of using of deep learning for this adjustment.

First, neural network research has focused on \emph{predictive} performance, but our goal is to produce a quality \emph{estimate} of the effect. I'll describe two adaptations to neural net design and training, based on insights from the statistical literature on the estimation of treatment effects. The first is a new architecture, the Dragonnet, that exploits the sufficiency of the propensity score for estimation adjustment. The second is a regularization procedure, targeted regularization, that induces a bias towards estimates that have non-parametrically optimal asymptotic properties.

Second, I'll describe how to use deep language models (e.g., BERT) for causal inference with text data. The challenge here is that text data is high dimensional, and naive dimension reduction may throw away information required for causal identification. The main insight is that the text representation produced by deep embedding methods suffices for the causal adjustment.

Tuesday, January 14, 2020 — 10:00 AM EST

**Statistical Inference for Multi-View Clustering**

In the multi-view data setting, multiple data sets are collected on a single, common set of observations. For example, we might perform genomic and proteomic assays on a single set of tumour samples, or we might collect relationship data from two online social networks for a single set of users. It is tempting to cluster the observations using all of the data views, in order to fully exploit the available information. However, clustering the observations using all of the data views implicitly assumes that a single underlying clustering of the observations is shared across all data views. If this assumption does not hold, then clustering the observations using all data views may lead to spurious results. We seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop new tests for answering this question based on multivariate and/or network data views, and apply them to multi-omics data from the Pioneer 100 Wellness Study (Price and others, 2017) and protein-protein interaction data from the HINT database (Das and Yu, 2012). We will also briefly discuss our current work on testing for no difference between the means of two estimated clusters in a single-view data set. This is joint work with Jacob Bien (University of Southern California) and Daniela Witten (University of Washington).

Monday, January 13, 2020 — 10:00 AM EST

**Sampling 'hard-to-reach' populations: recent developments**

In this talk, I will present some recent methodological developments in capture-recapture methods and Respondent-Driven Sampling (RDS).

In capture-recapture methods, our work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location one can create a capture-recapture data set about devices, or more specifically their users, that "visited" the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator were proposed.

RDS is a variant of link-tracing, a sampling technique for surveying hard-to-reach communities that takes advantage of community members' social networks to reach potential participants. While the RDS sampling mechanism and associated methods of adjusting for the sampling at the analysis stage are well-documented in the statistical sciences literature, methodological focus has largely been restricted to estimation of population means and proportions (e.g.~prevalence). As a network-based sampling method, RDS is faced with the fundamental problem of sampling from population networks where features such as homophily and differential activity (two measures of tendency for individuals with similar traits to share social links) are sensitive to the choice of a simulation and sampling method. In this work, *(i)* we present strategies for simulating RDS samples with known network and sample characteristics, so as to provide a foundation from which to expand the study of RDS analyses beyond the univariate framework and *(ii)* embed RDS within a causal inference framework and determine conditions under which average causal effects can be estimated. The proposed methodology will constitute a unifying approach that deals with simple estimands (means and proportions), with a natural extension to the study of associational and causal questions.

Friday, January 10, 2020 — 10:00 AM EST

#### Global and local estimation of low-rank random graphs

Random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk I will give two perspectives on the estimation task of low-rank random graphs. Specifically, I will focus on estimating the latent positions in random dot product graphs. The first component of the talk focuses on the global estimation task. The minimax lower bound for global estimation of the latent positions is established, and this minimax lower bound is achieved by a Bayes procedure, referred to as the posterior spectral embedding. The second component of the talk addresses the local estimation task. We define local efficiency in estimating each individual latent position, propose a novel one-step estimator that takes advantage of the curvature information of the likelihood function (i.e., derivatives information) of the graph model, and show that this estimator is locally efficient. The previously widely adopted adjacency spectral embedding is proven to be locally inefficient due to the ignorance of the curvature information of the likelihood function. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

Thursday, January 9, 2020 — 10:00 AM EST

#### Methodological Problems in Multigenerational Epidemiology

While epidemiology has typically focused on how exposures impact the individuals directly exposed, recent interest has been shown in investigating exposures with multigenerational effects—ones that affect the children and grandchildren of those directly exposed. For example, a recent motivating study examined the association between maternal in-utero diethylstilbestrol exposure and ADHD in the Nurses Health Study II. Such multigenerational studies, however, are susceptible to informative cluster size, occurring when the number of children to a mother (the cluster size) is related to their outcomes. But what if some women have no children at all? We first consider this problem of informatively empty clusters. Second, observing populations across multiple generations can be prohibitively expensive, so multigenerational studies often measure exposures retrospectively—and hence are susceptible to misclassification and recall bias. We thus study the impact of exposure misclassification when cluster size is potentially informative, as well as when misclassification is differential by cluster size. Finally, outside the relative control of laboratory settings, population-based multigenerational studies have had to entertain a broad range of study designs. We show that these designs have important implications on the scope of scientific inquiry, and we highlight areas in need of further methodological research.