Thursday, October 17, 2019 — 4:00 PM EDT
Building Deep Statistical Thinking for Data Science 2020: Privacy Protected Census, Gerrymandering, and Election
The year 2020 will be a busy one for statisticians and more generally data scientists. The US Census Bureau has announced that the data from the 2020 Census will be released under differential privacy (DP) protection, which in layperson’s terms means adding some noises to the data. While few would argue against protecting data privacy, many researchers, especially from the social sciences, are concerned whether the right trade-offs between data privacy and data utility are being made. The DP protection also has direct impact on redistricting, an issue that is already complicated enough with accurate counts, due to the need of guarding against excessive gerrymandering. The central statistical problem there is a rather unique one: how to determine whether a realization is an outlier with respect to a null distribution, when that null distribution itself cannot be fully determined? The 2020 US election will be another highly watched event, with many groups already busy making predictions. Will the lessons from predicting the 2016 US election be learned, or the failure be repeated? This talk invites the audience on a journey of deep statistical thinking prompted by these questions, regardless whether they have any interest in the US Census or politics.
Friday, October 11, 2019 — 10:30 AM EDT
Precision Factor Investing: Avoiding Factor Traps by Predicting Heterogeneous Effects of Firm Characteristics
We apply ideas from causal inference and machine learning to estimate the sensitivity of future stock returns to observable characteristics like size, value, and momentum. By analogy with the informal notion of a "value trap," we distinguish "characteristic traps" (stocks with weak sensitivity) from "characteristic responders" (those with strong sensitivity). We classify stocks by interpreting these distinctions as heterogeneous treatment effects (HTE), with characteristics interpreted as treatments and future returns interpreted as responses. The classification exploits a large set of stock features and recent work applying machine learning to HTE. Long-short strategies based on sorting stocks on characteristics perform significantly better when applied to characteristic responders than traps. A strategy based on the difference between these long-short returns profits from the predictability of HTE rather than from factors associated with the characteristics themselves. This is joint work with Pu He.