Events

Methodological Problems in Multigenerational Epidemiology

While epidemiology has typically focused on how exposures impact the individuals directly exposed, recent interest has been shown in investigating exposures with multigenerational effects—ones that affect the children and grandchildren of those directly exposed. For example, a recent motivating study examined the association between maternal in-utero diethylstilbestrol exposure and ADHD in the Nurses Health Study II. Such multigenerational studies, however, are susceptible to informative cluster size, occurring when the number of children to a mother (the cluster size) is related to their outcomes. But what if some women have no children at all? We first consider this problem of informatively empty clusters. Second, observing populations across multiple generations can be prohibitively expensive, so multigenerational studies often measure exposures retrospectively—and hence are susceptible to misclassification and recall bias. We thus study the impact of exposure misclassification when cluster size is potentially informative, as well as when misclassification is differential by cluster size. Finally, outside the relative control of laboratory settings, population-based multigenerational studies have had to entertain a broad range of study designs. We show that these designs have important implications on the scope of scientific inquiry, and we highlight areas in need of further methodological research.

Global and local estimation of low-rank random graphs

Random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk I will give two perspectives on the estimation task of low-rank random graphs. Specifically, I will focus on estimating the latent positions in random dot product graphs. The first component of the talk focuses on the global estimation task. The minimax lower bound for global estimation of the latent positions is established, and this minimax lower bound is achieved by a Bayes procedure, referred to as the posterior spectral embedding. The second component of the talk addresses the local estimation task. We define local efficiency in estimating each individual latent position, propose a novel one-step estimator that takes advantage of the curvature information of the likelihood function (i.e., derivatives information) of the graph model, and show that this estimator is locally efficient. The previously widely adopted adjacency spectral embedding is proven to be locally inefficient due to the ignorance of the curvature information of the likelihood function. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

Sampling 'hard-to-reach' populations: recent developments

In this talk, I will present some recent methodological developments in capture-recapture methods and Respondent-Driven Sampling (RDS).

In capture-recapture methods, our work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location one can create a capture-recapture data set about devices, or more specifically their users, that "visited" the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator were proposed.

RDS is a variant of link-tracing, a sampling technique for surveying hard-to-reach communities that takes advantage of community members' social networks to reach potential participants. While the RDS sampling mechanism and associated methods of adjusting for the sampling at the analysis stage are well-documented in the statistical sciences literature, methodological focus has largely been restricted to estimation of population means and proportions (e.g.~prevalence). As a network-based sampling method, RDS is faced with the fundamental problem of sampling from population networks where features such as homophily and differential activity (two measures of tendency for individuals with similar traits to share social links) are sensitive to the choice of a simulation and sampling method. In this work, (i) we present strategies for simulating RDS samples with known network and sample characteristics, so as to provide a foundation from which to expand the study of RDS analyses beyond the univariate framework and (ii) embed RDS within a causal inference framework and determine conditions under which average causal effects can be estimated. The proposed methodology will constitute a unifying approach that deals with simple estimands (means and proportions), with a natural extension to the study of associational and causal questions.

Statistical Inference for Multi-View Clustering

In the multi-view data setting, multiple data sets are collected on a single, common set of observations. For example, we might perform genomic and proteomic assays on a single set of tumour samples, or we might collect relationship data from two online social networks for a single set of users. It is tempting to cluster the observations using all of the data views, in order to fully exploit the available information. However, clustering the observations using all of the data views implicitly assumes that a single underlying clustering of the observations is shared across all data views. If this assumption does not hold, then clustering the observations using all data views may lead to spurious results. We seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop new tests for answering this question based on multivariate and/or network data views, and apply them to multi-omics data from the Pioneer 100 Wellness Study (Price and others, 2017) and protein-protein interaction data from the HINT database (Das and Yu, 2012). We will also briefly discuss our current work on testing for no difference between the means of two estimated clusters in a single-view data set. This is joint work with Jacob Bien (University of Southern California) and Daniela Witten (University of Washington).

Adapting black-box machine learning methods for causal inference

I'll cover two recent works on the use of deep learning for causal inference with observational data. The setup for the problem is: we have an observational dataset where each observation includes a treatment, an outcome, and covariates (confounders) that may affect the treatment and outcome. We want to estimate the causal effect of the treatment on the outcome; that is, what happens if we intervene? This effect is estimated by adjusting for the covariates. The talk covers two aspects of using of deep learning for this adjustment.

First, neural network research has focused on \emph{predictive} performance, but our goal is to produce a quality \emph{estimate} of the effect. I'll describe two adaptations to neural net design and training, based on insights from the statistical literature on the estimation of treatment effects. The first is a new architecture, the Dragonnet, that exploits the sufficiency of the propensity score for estimation adjustment. The second is a regularization procedure, targeted regularization, that induces a bias towards estimates that have non-parametrically optimal asymptotic properties.

Second, I'll describe how to use deep language models (e.g., BERT) for causal inference with text data. The challenge here is that text data is high dimensional, and naive dimension reduction may throw away information required for causal identification. The main insight is that the text representation produced by deep embedding methods suffices for the causal adjustment.

Sufficient Dimension Reduction for Populations with Structured Heterogeneity

Risk modeling has become a crucial component in the effective delivery of health care. A key challenge in building effective risk models is accounting for patient heterogeneity among the diverse populations present in health systems. Incorporating heterogeneity based on the presence of various comorbidities into risk models is crucial for the development of tailored care strategies, as it can provide patient-centered information and can result in more accurate risk prediction. Yet, in the presence of high dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. Towards this aim, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach can improve estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumption. We demonstrate the utility of our approach in the prediction of hospital admission risk for a large health system when tested on further follow-up data.

Diagnostics for Regression Models with Discrete Outcomes

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard residuals such as Pearson and deviance residuals for such outcomes often show a large discrepancy from the hypothesized pattern even under the true model and are not informative especially when data are highly discrete. To fill this gap, we propose a surrogate empirical residual distribution function for general discrete (e.g. ordinal and count) outcomes that serves as an alternative to the empirical Cox-Snell residual distribution function. When at least one continuous covariate is available, we show asymptotically that the proposed function converges uniformly to the identity function under the correctly specified model, even with highly discrete (e.g. binary) outcomes. Through simulation studies, we demonstrate empirically that the proposed surrogate empirical residual distribution function is highly effective for various diagnostic tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification.

The possibility of nearly assumption-free inference in causal inference

In causal effect estimation, the state-of-the-art is the so-called double machine learning (DML) estimators, which combine the benefit of doubly robust estimation, sample splitting and using machine learning methods to estimate nuisance parameters. The validity of the confidence interval associated with a DML estimator, in most part, relies on the complexity of nuisance parameters and how close the machine learning estimators are to the nuisance parameters. Before we have a complete understanding of the theory of many machine learning methods including deep neural networks, even a DML estimator may have a bias so large that prohibits valid inference. In this talk, we describe a nearly assumption-free procedure that can either criticize the invalidity of the Wald confidence interval associated with the DML estimators of some causal effect of interest or falsify the certificates (i.e. the mathematical conditions) that, if true, could ensure valid inference. Essentially, we are testing the null hypothesis that if the bias of an estimator is smaller than a fraction $\rho$ its standard error. Our test is valid under the null without requiring any complexity (smoothness or sparsity) assumptions on the nuisance parameters or the properties of machine learning estimators and may have power to inform the analysts that they have to do something else than DML estimators or Wald confidence intervals for inference purposes. This talk is based on joint work with Rajarshi Mukherjee and James M. Robins.

Clustering and Classification of Three-Way Data

Clustering and classification is the process of finding and analyzing underlying group structure in heterogenous data and is fundamental to computational statistics and machine learning. In the past, relatively simple techniques could be used for clustering; however, with data becoming increasingly complex, these methods are oftentimes not advisable, and in some cases not possible. One such such example is the analysis of three-way data where each data point is represented as a matrix instead of a traditional vector. Examples of three-way include greyscale images and multivariate longitudinal data. In this talk, recent methods for clustering three-way data will be presented including high-dimensional and skewed three-way data. Both simulated and real data will be used for illustration and future directions and extensions will be discussed.

Network Analysis of the Brain: from Generative Modeling to Multilayer Network Embedding of Functional Connectivity Data

Recent large-scale projects in neuroscience, such as the Human Connectome Project and the BRAIN initiative, emphasize the need of new statistical and computational techniques for analyzing functional connectivity within and across populations. Network-based models have greatly improved our understanding of brain structure and function, yet many important challenges remain. In this talk, I will consider two particularly important challenges: i) how does one characterize the generative mechanisms of functional connectivity, and ii) how does one identify discriminatory features among connectivity scans over disparate populations? To address the first challenge, I propose and describe a generative network model, called the correlation generalized exponential random graph model (cGERGM), that flexibly characterizes the joint network topology of correlation networks arising in functional connectivity. The model is the first of its kind to directly assess the network structure of a correlation network while simultaneously handling the mathematical constraints of a correlation matrix. I apply the cGERGM to resting state fMRI data from healthy individuals in the Human Connectome Project. The cGERGM reveals remarkably consistent organizational properties guiding subnetwork architecture, suggesting a fundamental organizational basis for subnetwork communication that differs from previous beliefs.

For the second challenge, I focus on learning interpretable features from complex multilayer networks arising in population studies of functional connectivity. I will introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that learns continuous node feature representations from multilayer networks. The multi-node2vec algorithm identifies maximum likelihood estimators of nodal features through the use of the Skip-gram neural network model. Asymptotic analysis of the algorithm reveals that it is a fast approximation to a multi-dimensional non-negative matrix factorization applied to a weighted average of the layers in the multilayer network. I apply multi-node2vec to a multilayer functional brain network from resting state fMRI scans over a population of 74 healthy individuals and 70 patients with varying degrees of schizophrenia. The identified functional embeddings closely associate with the functional organization of the brain and offer important insights into the differences between patient and healthy groups that is well-supported by theory.

Events

Filter by:

Department seminar by Glen McGee, Harvard T.H. Chan School of Public Health

Methodological Problems in Multigenerational Epidemiology

Department seminar by Fangzheng Xie, Johns Hopkins University

Global and local estimation of low-rank random graphs

Department seminar by Mamadou Yauck, McGill University

Department seminar by Lucy Gao, University of Washington

Department seminar by Victor Veitch, Columbia University

Department seminar by Jared Huling, Ohio State University

Department seminar by Lu Yang, University of Amsterdam

Department seminar by Lin Liu, Harvard University

Department seminar by Michael Gallaugher, McMaster University

Department seminar by James Wilson, University of San Francisco

Network Analysis of the Brain: from Generative Modeling to Multilayer Network Embedding of Functional Connectivity Data