Thursday, June 25, 2020 — 4:00 PM EDT

**Fairness through Experimentation: Inequality in A/B testing as an approach to responsible design**

As technology continues to advance, there is increasing concern about individuals being left behind. Many businesses are striving to adopt responsible design practices and avoid any unintended consequences of their products and services, ranging from privacy vulnerabilities to algorithmic bias. We propose a novel approach to fairness and inclusiveness based on experimentation. We use experimentation because we want to assess not only the intrinsic properties of products and algorithms but also their impact on people. We do this by introducing an inequality approach to A/B testing, leveraging the Atkinson index from the economics literature. We show how to perform causal inference over this inequality measure. We also introduce the concept of site-wide inequality impact, which captures the inclusiveness impact of targeting specific subpopulations for experiments, and show how to conduct statistical inference on this impact. We provide real examples from LinkedIn, as well as an open-source, highly scalable implementation of the computation of the Atkinson index and its variance in Spark/Scala. We also provide over a year's worth of learnings -- gathered by deploying our method at scale and analyzing thousands of experiments -- on which areas and which kinds of product innovations seem to inherently foster fairness through inclusiveness.

The details to connect to this seminar will be available in the near future.

Thursday, June 4, 2020 — 4:00 PM EDT

**Can the reported COVID-19 data tell us the truth? Scrutinizing the data from the measurement error models perspective**

The mystery of the coronavirus disease 2019 (COVID-19) and the lack of effective treatment for COVID-19 have presented a strikingly negative impact on public health. While research on COVID-19 has been ramping up rapidly, a very important yet overlooked challenge is on the quality and unique features of COVID-19 data. The manifestations of COVID-19 are not yet well understood. The swift spread of the virus is largely attributed to its stealthy transmissions in which infected patients may be asymptomatic or exhibit only flu-like symptoms in the early stage. Due to the limited test resources and a good portion of asymptomatic infections, the confirmed cases are typically under-reported, error-contaminated, and involved with substantial noise. If the drastic effects of faulty data are not being addressed, analysis results of the COVID-19 data can be seriously biased.

In this talk, I will discuss the issues induced from faulty COVID-19 data and how they may challenge inferential procedures. I will describe a strategy of employing measurement error models to address the error effects. Sensitivity analyses will be conducted to quantify the impact of faulty data for different scenarios. In addition, I will present a website of COVID-19 Canada (https://covid-19-canada.uwo.ca/), developed by the team co-led by Dr. Wenqing He and myself, which provides comprehensive and real-time visualization of the Canadian COVID-19 data.

**Please note:** This seminar will be given online through Webex. To join, please follow this link:** Virtual seminar by Grace Yi.**

Thursday, May 7, 2020 — 4:00 PM EDT

**Please note: This seminar will be given online.**

**Two-sample test on funscalar data with application to hemodialysis monitoring by Raman spectroscopy**

To achieve in-session monitoring of hemodialysis through Raman spectroscopy, it is necessary to compare data consist of Raman spectra and intensity values for specific biomarkers (e.g., urea) contained in waste dialysate used in hemodialysis treatement. This calls for the development of a two-sample test procedure for funscalar data, data that are a mix of functional and scalar variables. Despite a rich literature on univariate functional data testing procedures and a few publications on multivariate functional data testing procedures, there is no such a testing procedure for funscalar data. In this work we propose the first testing procedure for funscalar data, generalizing the functional data approach in Horvath et al (2013). The test statistic is based on the L_2 distance between the two mean funscalar objects. Its asymptotic null distribution and asymptotic power are studied. We then demonstrate its performance through extensive simulations and its usefulness is through data collected in our hemodialysis monitoring experiments.

This seminar will be hosted by Webex.

To join, please follow this link: Department Seminar by Pang Du

Thursday, March 12, 2020 — 4:00 PM EDT

** Bayesian Additive Regression Trees for Statistical Learning**

Regression trees are flexible non-parametric models that are well suited to many modern statistical learning problems. Many such tree models have been proposed, from the simple single-tree model (e.g. Classification and Regression Trees — CART) to more complex tree ensembles (e.g. Random Forests). Their nonparametric formulation allows one to model datasets exhibiting complex non-linear relationships between predictors and the response. A recent innovation in the statistical literature is the development of a Bayesian analogue to these classical regression tree models. The benefit of the Bayesian approach is the ability to quantify uncertainties within a holistic Bayesian framework. We introduce the most popular variant, the Bayesian Additive Regression Trees (BART) model, and describe recent innovations to this framework. We conclude with some of the exciting research directions currently being explored.

Thursday, March 12, 2020 — 2:45 PM EDT

Please note: This seminar has been cancelled.

Friday, March 6, 2020 — 10:30 AM EST

**Please note: This seminar has been cancelled.**

Thursday, March 5, 2020 — 4:00 PM EST

**Concentration of Maxima: Fundamental Limits of Exact Support Recovery in High Dimensions**

We study the estimation of the support (set of non-zero components) of a sparse high-dimensional signal observed with additive and dependent noise. With the usual parameterization of the size of the support set and the signal magnitude, we characterize a phase-transition phenomenon akin to the Ingster’s signal detection boundary. We show that when the signal is above the so-called strong classification boundary, thresholding estimators achieve asymptotically perfect support recovery. This is so under arbitrary error dependence assumptions, provided that the marginal error distribution has rapidly varying tails. Conversely, under mild dependence conditions on the noise, we show that no thresholding estimators can achieve perfect support recovery if the signal is below the boundary. For log-concave error densities, the thresholding estimators are shown to be optimal and hence the strong classification boundary is universal, in this setting.

The proofs exploit a concentration of maxima phenomenon, known as relative stability. We obtain a complete characterization of the relative stability phenomenon for dependent Gaussian noise via Slepian, Sudakov-Fernique bounds and some Ramsey theory.

Friday, February 7, 2020 — 10:00 AM EST

**The Extended Reproducibility Phenotype - Re-framing and Generalizing Computational Reproducibility**

Computational reproducibility has become a crucial part of how data analytic results are understood and assessed both in and outside of academia. Less work, however, has explored whether these strict computational reproducibility criteria are necessary or sufficient to actually meet our needs as consumers of analysis results. I will show that in principle they are neither. I will present two inter-related veins of work. First, I will provide a conceptual reframing of the concepts of strict reproducibility, and the actions analysts take to ensure it, in terms of our ability to actually trust the results and the claims about the underlying data-generating systems they embody. Second, I will present a generalized conception of reproducibily by introducing the concepts of Currency, Comparability and Completeness and their oft-overlooked importance to assessing data analysis results.

Thursday, February 6, 2020 — 10:00 AM EST

**Censoring Unbiased Regression Trees and Ensembles**

Tree-based methods are useful tools to identify risk groups and conduct prediction by employing recursive partitioning to separate subjects into different risk groups. We propose a novel paradigm of building regression trees for censored data in survival analysis. We prudently construct the censored-data loss function through an extension of the theory of censoring unbiased transformations. With the construction, we can conveniently implement the proposed regression trees algorithm using existing software for the Classification and Regression Trees algorithm (e.g., rpart package in R) and extend it for ensemble learning. Simulations and real data examples demonstrate that our methods either improve upon or remain competitive with existing tree-based algorithms for censored data.

Wednesday, February 5, 2020 — 10:00 AM EST

**Detecting the Signal Among Noise and Contamination in High Dimensions**

Improvements in biomedical technology and a surge in other data-driven sciences lead to the collection of increasingly large amounts of data. In this affluence of data, contamination is ubiquitous but often neglected, creating substantial risk of spurious scientific discoveries. Especially in applications with high-dimensional data, for instance proteomic biomarker discovery, the impact of contamination on methods for variable selection and estimation can be profound yet difficult to diagnose.

In this talk I present a method for variable selection and estimation in high-dimensional linear regression models, leveraging the elastic-net penalty for complex data structures. The method is capable of harnessing the collected information even in the presence of arbitrary contamination in the response and the predictors. I showcase the method’s theoretical and practical advantages, specifically in applications with heavy-tailed errors and limited control over the data. I outline efficient algorithms to tackle computational challenges posed by inherently non-convex objective functions of robust estimators and practical strategies for hyper-parameter selection, ensuring scalability of the method and applicability to a wide range of problems.

Tuesday, February 4, 2020 — 10:00 AM EST

**Bayesian Utility-Based Toxicity Probability Interval Design for Dose Finding in Phase I/II Trials**

Molecularly targeted agents and immunotherapy have revolutionized modern cancer treatment. Unlike chemotherapy, the maximum tolerated dose of the targeted therapies may not pose significant clinical benefit over the lower doses. By simultaneously considering both binary toxicity and efficacy endpoints, phase I/II trials can identify a better dose for subsequent phase II trials than traditional phase I trials in terms of efficacy-toxicity tradeoff. Existing phase I/II dose-finding methods are model-based or need to pre-specify many design parameters, which makes them difficult to implement in practice. To strengthen and simplify the current practice of phase I/II trials, we propose a utility-based toxicity probability interval (uTPI) design for finding the optimal biological dose (OBD) where binary toxicity and efficacy endpoints are observed. The uTPI design is model-assisted in nature, simply modeling the utility outcomes observed at the current dose level based on a quasibinomial likelihood. Toxicity probability intervals are used to screen out overly toxic dose levels, and then the dose escalation/de-escalation decisions are made adaptively by comparing the posterior utility distributions of the adjacent levels of the current dose. The uTPI design is flexible in accommodating various utility functions while only needs minimum design parameters. A prominent feature of the uTPI design is that it has a simple decision structure such that a concise dose-assignment decision table can be calculated before the start of trial and be used throughout the trial, which greatly simplifies practical implementation of the design. Extensive simulation studies demonstrate that the proposed uTPI design yields desirable as well as robust performance under various scenarios. This talk is based on the joint work with Ruitao Lin and Ying Yuan at MD Anderson Cancer Center.

Thursday, January 30, 2020 — 10:00 AM EST

**Batch-mode active learning for regression and its application to the valuation of large variable annuity portfolios**

Supervised learning algorithms require a sufficient amount of labeled data to construct an accurate predictive model. In practice, collecting labeled data may be extremely time-consuming while unlabeled data can be easily accessed. In a situation where labeled data are insufficient for a prediction model to perform well and the budget for an additional data collection is limited, it is important to effectively select objects to be labeled based on whether they contribute to a great improvement in the model's performance. In this talk, I will focus on the idea of active learning that aims to train an accurate prediction model with minimum labeling cost. In particular, I will present batch-mode active learning for regression problems. Based on random forest, I will propose two effective random sampling algorithms that consider the prediction ambiguities and diversities of unlabeled objects as measures of their informativeness. Empirical results on an insurance data set demonstrate the effectiveness of the proposed approaches in valuing large variable annuity portfolios (which is a practical problem in the actuarial field). Additionally, comparisons with the existing framework that relies on a sequential combination of unsupervised and supervised learning algorithms are also investigated.

Wednesday, January 29, 2020 — 10:00 AM EST

#### Marginal analysis of multiple outcomes with informative cluster size

Periodontal disease is a serious infection of the gums and the bones surrounding the teeth. In Veterans Affairs Dental Longitudinal Study (VADLS), the relationships between periodontal disease and other health and socioeconomic conditions are of interest. To determine whether or not a patient has periodontal disease, multiple clinical measurements (clinical attachment loss, alveolar bone loss, tooth mobility) are taken at the tooth-level. However, a universal definition for periodontal disease does not exist and researchers often create a composite outcome from these measurements or analyze each outcome separately. Moreover, patients have varying number of teeth, with those that are more prone to the disease having fewer teeth compared to those with good oral health. Such dependence between the outcome of interest and cluster size (number of teeth) is called informative cluster size, and results obtained from fitting conventional marginal models can be biased. In this talk, I will introduce a novel method to jointly analyze multiple correlated outcomes for clustered data with informative cluster size using the class of generalized estimating equations (GEE) with cluster-specific weights. Using the data from VADLS, I will compare the results obtained from the proposed multivariate outcome cluster-weighted GEE to those from the conventional unweighted GEE. Finally, I will discuss a few other research settings where data may exhibit informative cluster size.

Monday, January 27, 2020 — 10:00 AM EST

#### Network Analysis of the Brain: from Generative Modeling to Multilayer Network Embedding of Functional Connectivity Data

Recent large-scale projects in neuroscience, such as the Human Connectome Project and the BRAIN initiative, emphasize the need of new statistical and computational techniques for analyzing functional connectivity within and across populations. Network-based models have greatly improved our understanding of brain structure and function, yet many important challenges remain. In this talk, I will consider two particularly important challenges: i) how does one characterize the generative mechanisms of functional connectivity, and ii) how does one identify discriminatory features among connectivity scans over disparate populations? To address the first challenge, I propose and describe a generative network model, called the correlation generalized exponential random graph model (cGERGM), that flexibly characterizes the joint network topology of correlation networks arising in functional connectivity. The model is the first of its kind to directly assess the network structure of a correlation network while simultaneously handling the mathematical constraints of a correlation matrix. I apply the cGERGM to resting state fMRI data from healthy individuals in the Human Connectome Project. The cGERGM reveals remarkably consistent organizational properties guiding subnetwork architecture, suggesting a fundamental organizational basis for subnetwork communication that differs from previous beliefs.

For the second challenge, I focus on learning interpretable features from complex multilayer networks arising in population studies of functional connectivity. I will introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that learns continuous node feature representations from multilayer networks. The multi-node2vec algorithm identifies maximum likelihood estimators of nodal features through the use of the Skip-gram neural network model. Asymptotic analysis of the algorithm reveals that it is a fast approximation to a multi-dimensional non-negative matrix factorization applied to a weighted average of the layers in the multilayer network. I apply multi-node2vec to a multilayer functional brain network from resting state fMRI scans over a population of 74 healthy individuals and 70 patients with varying degrees of schizophrenia. The identified functional embeddings closely associate with the functional organization of the brain and offer important insights into the differences between patient and healthy groups that is well-supported by theory.

Friday, January 24, 2020 — 10:00 AM EST

**Clustering and Classification of Three-Way Data**

Clustering and classification is the process of finding and analyzing underlying group structure in heterogenous data and is fundamental to computational statistics and machine learning. In the past, relatively simple techniques could be used for clustering; however, with data becoming increasingly complex, these methods are oftentimes not advisable, and in some cases not possible. One such such example is the analysis of three-way data where each data point is represented as a matrix instead of a traditional vector. Examples of three-way include greyscale images and multivariate longitudinal data. In this talk, recent methods for clustering three-way data will be presented including high-dimensional and skewed three-way data. Both simulated and real data will be used for illustration and future directions and extensions will be discussed.

Wednesday, January 22, 2020 — 10:00 AM EST

**The possibility of nearly assumption-free inference in causal inference**

In causal effect estimation, the state-of-the-art is the so-called double machine learning (DML) estimators, which combine the benefit of doubly robust estimation, sample splitting and using machine learning methods to estimate nuisance parameters. The validity of the confidence interval associated with a DML estimator, in most part, relies on the complexity of nuisance parameters and how close the machine learning estimators are to the nuisance parameters. Before we have a complete understanding of the theory of many machine learning methods including deep neural networks, even a DML estimator may have a bias so large that prohibits valid inference. In this talk, we describe a nearly assumption-free procedure that can either criticize the invalidity of the Wald confidence interval associated with the DML estimators of some causal effect of interest or falsify the certificates (i.e. the mathematical conditions) that, if true, could ensure valid inference. Essentially, we are testing the null hypothesis that if the bias of an estimator is smaller than a fraction $\rho$ its standard error. Our test is valid under the null without requiring any complexity (smoothness or sparsity) assumptions on the nuisance parameters or the properties of machine learning estimators and may have power to inform the analysts that they have to do something else than DML estimators or Wald confidence intervals for inference purposes. This talk is based on joint work with Rajarshi Mukherjee and James M. Robins.

Tuesday, January 21, 2020 — 10:00 AM EST

**Diagnostics for Regression Models with Discrete Outcomes**

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard residuals such as Pearson and deviance residuals for such outcomes often show a large discrepancy from the hypothesized pattern even under the true model and are not informative especially when data are highly discrete. To fill this gap, we propose a surrogate empirical residual distribution function for general discrete (e.g. ordinal and count) outcomes that serves as an alternative to the empirical Cox-Snell residual distribution function. When at least one continuous covariate is available, we show asymptotically that the proposed function converges uniformly to the identity function under the correctly specified model, even with highly discrete (e.g. binary) outcomes. Through simulation studies, we demonstrate empirically that the proposed surrogate empirical residual distribution function is highly effective for various diagnostic tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification.

Monday, January 20, 2020 — 10:00 AM EST

**Sufficient Dimension Reduction for Populations with Structured Heterogeneity**

Risk modeling has become a crucial component in the effective delivery of health care. A key challenge in building effective risk models is accounting for patient heterogeneity among the diverse populations present in health systems. Incorporating heterogeneity based on the presence of various comorbidities into risk models is crucial for the development of tailored care strategies, as it can provide patient-centered information and can result in more accurate risk prediction. Yet, in the presence of high dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. Towards this aim, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach can improve estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumption. We demonstrate the utility of our approach in the prediction of hospital admission risk for a large health system when tested on further follow-up data.

Thursday, January 16, 2020 — 10:00 AM EST

**Adapting black-box machine learning methods for causal inference**

I'll cover two recent works on the use of deep learning for causal inference with observational data. The setup for the problem is: we have an observational dataset where each observation includes a treatment, an outcome, and covariates (confounders) that may affect the treatment and outcome. We want to estimate the causal effect of the treatment on the outcome; that is, what happens if we intervene? This effect is estimated by adjusting for the covariates. The talk covers two aspects of using of deep learning for this adjustment.

First, neural network research has focused on \emph{predictive} performance, but our goal is to produce a quality \emph{estimate} of the effect. I'll describe two adaptations to neural net design and training, based on insights from the statistical literature on the estimation of treatment effects. The first is a new architecture, the Dragonnet, that exploits the sufficiency of the propensity score for estimation adjustment. The second is a regularization procedure, targeted regularization, that induces a bias towards estimates that have non-parametrically optimal asymptotic properties.

Second, I'll describe how to use deep language models (e.g., BERT) for causal inference with text data. The challenge here is that text data is high dimensional, and naive dimension reduction may throw away information required for causal identification. The main insight is that the text representation produced by deep embedding methods suffices for the causal adjustment.

Tuesday, January 14, 2020 — 10:00 AM EST

**Statistical Inference for Multi-View Clustering**

In the multi-view data setting, multiple data sets are collected on a single, common set of observations. For example, we might perform genomic and proteomic assays on a single set of tumour samples, or we might collect relationship data from two online social networks for a single set of users. It is tempting to cluster the observations using all of the data views, in order to fully exploit the available information. However, clustering the observations using all of the data views implicitly assumes that a single underlying clustering of the observations is shared across all data views. If this assumption does not hold, then clustering the observations using all data views may lead to spurious results. We seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop new tests for answering this question based on multivariate and/or network data views, and apply them to multi-omics data from the Pioneer 100 Wellness Study (Price and others, 2017) and protein-protein interaction data from the HINT database (Das and Yu, 2012). We will also briefly discuss our current work on testing for no difference between the means of two estimated clusters in a single-view data set. This is joint work with Jacob Bien (University of Southern California) and Daniela Witten (University of Washington).

Monday, January 13, 2020 — 10:00 AM EST

**Sampling 'hard-to-reach' populations: recent developments**

In this talk, I will present some recent methodological developments in capture-recapture methods and Respondent-Driven Sampling (RDS).

In capture-recapture methods, our work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location one can create a capture-recapture data set about devices, or more specifically their users, that "visited" the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator were proposed.

RDS is a variant of link-tracing, a sampling technique for surveying hard-to-reach communities that takes advantage of community members' social networks to reach potential participants. While the RDS sampling mechanism and associated methods of adjusting for the sampling at the analysis stage are well-documented in the statistical sciences literature, methodological focus has largely been restricted to estimation of population means and proportions (e.g.~prevalence). As a network-based sampling method, RDS is faced with the fundamental problem of sampling from population networks where features such as homophily and differential activity (two measures of tendency for individuals with similar traits to share social links) are sensitive to the choice of a simulation and sampling method. In this work, *(i)* we present strategies for simulating RDS samples with known network and sample characteristics, so as to provide a foundation from which to expand the study of RDS analyses beyond the univariate framework and *(ii)* embed RDS within a causal inference framework and determine conditions under which average causal effects can be estimated. The proposed methodology will constitute a unifying approach that deals with simple estimands (means and proportions), with a natural extension to the study of associational and causal questions.

Friday, January 10, 2020 — 10:00 AM EST

#### Global and local estimation of low-rank random graphs

Random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk I will give two perspectives on the estimation task of low-rank random graphs. Specifically, I will focus on estimating the latent positions in random dot product graphs. The first component of the talk focuses on the global estimation task. The minimax lower bound for global estimation of the latent positions is established, and this minimax lower bound is achieved by a Bayes procedure, referred to as the posterior spectral embedding. The second component of the talk addresses the local estimation task. We define local efficiency in estimating each individual latent position, propose a novel one-step estimator that takes advantage of the curvature information of the likelihood function (i.e., derivatives information) of the graph model, and show that this estimator is locally efficient. The previously widely adopted adjacency spectral embedding is proven to be locally inefficient due to the ignorance of the curvature information of the likelihood function. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

Thursday, January 9, 2020 — 10:00 AM EST

#### Methodological Problems in Multigenerational Epidemiology

While epidemiology has typically focused on how exposures impact the individuals directly exposed, recent interest has been shown in investigating exposures with multigenerational effects—ones that affect the children and grandchildren of those directly exposed. For example, a recent motivating study examined the association between maternal in-utero diethylstilbestrol exposure and ADHD in the Nurses Health Study II. Such multigenerational studies, however, are susceptible to informative cluster size, occurring when the number of children to a mother (the cluster size) is related to their outcomes. But what if some women have no children at all? We first consider this problem of informatively empty clusters. Second, observing populations across multiple generations can be prohibitively expensive, so multigenerational studies often measure exposures retrospectively—and hence are susceptible to misclassification and recall bias. We thus study the impact of exposure misclassification when cluster size is potentially informative, as well as when misclassification is differential by cluster size. Finally, outside the relative control of laboratory settings, population-based multigenerational studies have had to entertain a broad range of study designs. We show that these designs have important implications on the scope of scientific inquiry, and we highlight areas in need of further methodological research.

Tuesday, January 7, 2020 — 10:00 PM EST

#### Renewable Estimation and Incremental Inference in Streaming Data Analysis

New data collection and storage technologies have given rise to a new field of streaming data analytics, including real-time statistical methodology for online data analyses. Streaming data refers to high-throughput recordings with large volumes of observations gathered sequentially and perpetually over time. Such type of data includes national disease registry, mobile health, and disease surveillance, among others. This talk primarily concerns the development of a fast real-time statistical estimation and inference method for regression analysis, with a particular objective of addressing challenges in streaming data storage and computational efficiency. Termed as renewable estimation, this method enjoys strong theoretical guarantees, including both asymptotic unbiasedness and estimation efficiency, and fast computational speed. The key technical novelty pertains to the fact that the proposed method uses current data and summary statistics of historical data. The proposed algorithm will be demonstrated in generalized linear models (GLM) for cross-sectional data. I will discuss both conceptual understanding and theoretical guarantees of the method and illustrate its performance via numerical examples. This is joint work with my supervisor Professor Peter Song.

Monday, January 6, 2020 — 10:00 AM EST

#### Navigation and Evaluation of Latent Structure in High-Dimensional Data

In the modern data analysis paradigm, fitting models is easy, but knowing how to design or evaluate them is difficult. In this talk, we will adapt insights from graphical statistics and goodness-of-fit testing to modern problems, illustrating them with applications to microbiome genomics and climate systems science.

For the microbiome, we show how linking complementary displays can make it easy to query structure in raw data. We also find novel visual summaries that inform model criticism more deeply than data splitting strategies alone. We then describe how artificial intelligence can be used to accelerate climate simulations, and introduce techniques for characterizing goodness-of-fit of the resulting models.

Viewed broadly, these projects provide opportunities for human interaction in the automated data processing regime, facilitating (1) streamlined navigation of data and (2) critical evaluation of models.