Monday, January 20, 2020 — 10:00 AM EST

Sufficient Dimension Reduction for Populations with Structured Heterogeneity

Risk modeling has become a crucial component in the effective delivery of health care. A key challenge in building effective risk models is accounting for patient heterogeneity among the diverse populations present in health systems. Incorporating heterogeneity based on the presence of various comorbidities into risk models is crucial for the development of tailored care strategies, as it can provide patient-centered information and can result in more accurate risk prediction. Yet, in the presence of high dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. Towards this aim, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach can improve estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumption. We demonstrate the utility of our approach in the prediction of hospital admission risk for a large health system when tested on further follow-up data.

Tuesday, January 21, 2020 — 10:00 AM EST

Diagnostics for Regression Models with Discrete Outcomes

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard residuals such as Pearson and deviance residuals for such outcomes often show a large discrepancy from the hypothesized pattern even under the true model and are not informative especially when data are highly discrete. To fill this gap, we propose a surrogate empirical residual distribution function for general discrete (e.g. ordinal and count) outcomes that serves as an alternative to the empirical Cox-Snell residual distribution function. When at least one continuous covariate is available, we show asymptotically that the proposed function converges uniformly to the identity function under the correctly specified model, even with highly discrete (e.g. binary) outcomes. Through simulation studies, we demonstrate empirically that the proposed surrogate empirical residual distribution function is highly effective for various diagnostic tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification.

Wednesday, January 22, 2020 — 10:00 AM EST

The possibility of nearly assumption-free inference in causal inference

In causal effect estimation, the state-of-the-art is the so-called double machine learning (DML) estimators, which combine the benefit of doubly robust estimation, sample splitting and using machine learning methods to estimate nuisance parameters. The validity of the confidence interval associated with a DML estimator, in most part, relies on the complexity of nuisance parameters and how close the machine learning estimators are to the nuisance parameters. Before we have a complete understanding of the theory of many machine learning methods including deep neural networks, even a DML estimator may have a bias so large that prohibits valid inference. In this talk, we describe a nearly assumption-free procedure that can either criticize the invalidity of the Wald confidence interval associated with the DML estimators of some causal effect of interest or falsify the certificates (i.e. the mathematical conditions) that, if true, could ensure valid inference. Essentially, we are testing the null hypothesis that if the bias of an estimator is smaller than a fraction $\rho$ its standard error. Our test is valid under the null without requiring any complexity (smoothness or sparsity) assumptions on the nuisance parameters or the properties of machine learning estimators and may have power to inform the analysts that they have to do something else than DML estimators or Wald confidence intervals for inference purposes. This talk is based on joint work with Rajarshi Mukherjee and James M. Robins.

Thursday, January 30, 2020 — 10:00 AM EST

Batch-mode active learning for regression and its application to the valuation of large variable annuity portfolios

Supervised learning algorithms require a sufficient amount of labeled data to construct an accurate predictive model. In practice, collecting labeled data may be extremely time-consuming while unlabeled data can be easily accessed. In a situation where labeled data are insufficient for a prediction model to perform well and the budget for an additional data collection is limited, it is important to effectively select objects to be labeled based on whether they contribute to a great improvement in the model's performance. In this talk, I will focus on the idea of active learning that aims to train an accurate prediction model with minimum labeling cost. In particular, I will present batch-mode active learning for regression problems. Based on random forest, I will propose two effective random sampling algorithms that consider the prediction ambiguities and diversities of unlabeled objects as measures of their informativeness. Empirical results on an insurance data set demonstrate the effectiveness of the proposed approaches in valuing large variable annuity portfolios (which is a practical problem in the actuarial field). Additionally, comparisons with the existing framework that relies on a sequential combination of unsupervised and supervised learning algorithms are also investigated.

Wednesday, February 5, 2020 — 10:00 AM EST

Detecting the Signal Among Noise and Contamination in High Dimensions

Improvements in biomedical technology and a surge in other data-driven sciences lead to the collection of increasingly large amounts of data. In this affluence of data, contamination is ubiquitous but often neglected, creating substantial risk of spurious scientific discoveries. Especially in applications with high-dimensional data, for instance proteomic biomarker discovery, the impact of contamination on methods for variable selection and estimation can be profound yet difficult to diagnose.
In this talk I present a method for variable selection and estimation in high-dimensional linear regression models, leveraging the elastic-net penalty for complex data structures. The method is capable of harnessing the collected information even in the presence of arbitrary contamination in the response and the predictors. I showcase the method’s theoretical and practical advantages, specifically in applications with heavy-tailed errors and limited control over the data. I outline efficient algorithms to tackle computational challenges posed by inherently non-convex objective functions of robust estimators and practical strategies for hyper-parameter selection, ensuring scalability of the method and applicability to a wide range of problems.

Friday, February 21, 2020 — 10:30 AM EST

To be announced.

Friday, March 6, 2020 — 10:30 AM EST


Friday, March 20, 2020 — 10:30 AM EDT


Friday, April 3, 2020 — 10:30 AM EDT


  1. 2020 (27)
    1. June (1)
    2. May (3)
    3. April (2)
    4. March (2)
    5. February (5)
    6. January (14)
  2. 2019 (65)
    1. December (3)
    2. November (8)
    3. October (8)
    4. September (4)
    5. August (2)
    6. July (2)
    7. June (2)
    8. May (6)
    9. April (7)
    10. March (6)
    11. February (4)
    12. January (13)
  3. 2018 (44)
  4. 2017 (55)
  5. 2016 (44)
  6. 2015 (38)
  7. 2014 (44)
  8. 2013 (46)
  9. 2012 (44)