Veridical Data Science

Message from Dr. Yu:

Veridical (truthful) Data Science (VDS) is a new paradigm for data science through creative and grounded synthesis and expansion of best practices and ideas in machine learning and statistics. It has been developed in the last decade by me and my team. It is based on the three fundamental principles of data science: predictability, computability and stability (PCS) that integrate ML and statistics with a significant expansion of traditional stats uncertainty from sample-to-sample variability to include uncertainties from data cleaning and algorithm choices, among other human judgment calls. It aims to meet challenges of the reproducibility crisis to arrive at responsible and reliable data analysis and decision-making by fully accounting for sources of uncertainty and insisting on reality check in a data science life cycle. 

My Veridical Data Science (VDS) book with my former student Rebecca Barter has been published by the MIT Press in 2024 in their machine learning series, but we have a free on-line version at vdsbook.com. It is a very accessible book aimed at training critical thinking through mainly narratives (not much math) and case studies. It is designed for upper division and beginning graduate students and domain experts alike.

A very positive book review published in Harvard Data Science Review  has been written by Yuval and Yoav Benjamini.  I am attaching some slides from my talk about the book. The book took us 9 years from the start to publication with many versions written as the PCS framework for VDS was evolving since early 2010s through collaborative research projects in genomics, neuroscience and precision medicine, and teaching stats 215A at Berkeley (a first year core PhD statistics course called Applied stats and ML).

On the research front, here are seven recent PCS papers: 

1. HCM paper for cost-effective generation of hypotheses in finding genetic drivers of heart disease Hypertrophic Cardiomyopathy (HCM), in collaboration with Ashley Group at Stanford Medical School. (4 out of 5 or 80% recommendations confirmed by experiments).

2. Prostate Cancer Detection paper for stress-testing data cleaning stability and halving the number of genes for prostate cancer detection with a huge AUC performance improvement relative to the current clinical test PSA (from 60% to 80%), in collaboration with Chinnaiyan Group at Michigan U Medical School. 

3. PCS-UQ paper -- expansion on Ch. 13 of the VDS book and with 23 datasets, showing 20% average size reduction over the best conformal method considered. One step in PCS-UQ is actually a new form of conformal.

4. PCS workflow paper -- an updated and concise intro to PCS.

5.  NESS  paper -- a PCS-guided enhancement of t-SNE and UMAP.

6. Veridical data science and medical foundation models on arXiv by Alaa and Yu.

7. MERITS paper -- a PCS-guided primer for data-driven simulation design (to appear, JCGS, 2025)

VDS | PCS