Primary Research Interests
Classical statistical tools are designed for testing pre-specified hypotheses about pre-specified models. In the real world, data analysis is an adaptive process that involves exploring the data, fitting several models, evaluating these models to select the best one, and then testing hypotheses about this selected model. I refer to the practice of using the same data for multiple tasks along this exploratory pipeline as "double dipping". When classical statistical tools are applied without care in contexts that involve double dipping, the conclusions may be erroneous. Motivated by the gap between classical statistical tools and practical data analysis, my research program focuses on enabling scientists to safely draw conclusions from data in realistic settings where models and hypotheses are not pre-specified.
There are two primary ways to avoid the pitfalls associated with double dipping.
- Account for double dipping by creating specialized statistical procedures that directly account for the double-use of data.
- Avoid double dipping by splitting the data into independent training and test sets, such that only one set is used for each task. While this is typically accomplished via sample splitting, there are settings in which sample splitting is not an option and where alternatives are needed.
My recent projects have focused on developing specialized procedures that account for double dipping (1), or on developing alternatives to sample splitting that allow us to avoid double dipping (2). Publications and talks related to these projects are linked below. For a relatively recent overview, you can see the slides from my
dissertation defense.
Featured publications or preprints
- Ameer Dharamshi, Anna Neufeld, Lucy Gao, Daniela Witten, and Jacob Bien (2024+) Generalized data thinning using sufficient statistics. Journal of the American Statistical Association .
[paper]
- Anna Neufeld, Ameer Dharamshi, Lucy Gao, and Daniela Witten (2024) Data thinning for convolution-closed distributions. Journal of Machine Learning Research 25(57):1−35 .
[paper]
[website]
[R package]
[simulation code]
- Anna Neufeld, Lucy L. Gao, Joshua Popp, Alexis Battle, and Daniela Witten (2024) Inference after latent variable estimation for single-cell RNA sequencing data. Biostatistics 25 (1), 270-287 .
[paper]
[simulation code]
[package]
[package website]
[package tutorials]
- Anna Neufeld, Joshua Popp, Lucy L. Gao, Alexis Battle, and Daniela Witten (2024+) Negative binomial count splitting for single-cell RNA sequencing data. [preprint]
- For more information, see the talk I gave at the UW Combi Seminar in January, 2024.
- Anna Neufeld, Lucy L. Gao, and Daniela Witten (2022) Tree-values: selective inference for regression trees. Journal of Machine Learning Research 23 (1), 13759-13801 . [paper]
[website]
[R package]
Additional publications or preprints
- Anna Neufeld, Witten D. (2021). Discussion of Breiman's" Two Cultures": From Two Cultures to One. Observational Studies 7 (1), 171-174.
[paper]
- Maxian, O*., Anna Neufeld*., Talis, E. J*., Childs, L. M., & Blackwood, J. C. (2017). Zika virus dynamics: When does sexual transmission matter?. Epidemics, 21, 48-55. (* denotes equal contribution).
[paper]