The datathin
R package splits a random variable (or a vector or matrix of random variables) into an independent training set and a test set using the methodology introduced in Neufeld et al., 2023 (link to preprint).
Make sure that remotes
is installed by running install.packages("remotes")
, then type
remotes::install_github("anna-neufeld/datathin")
For now, you can check our our introductory tutorial, which gives basic examples of how to apply data thinning under a variety of distributional assumptions. You can also check out our unsupervised learning tutorial, which shows how data thinning can be applied to estimate the number of clusters in normally distributed data.
More tutorials for this package are coming soon! We will provide examples of how to use datathin
for tasks such as model evaluation and inference after model selection under a variety of distributional assumptions.
To learn more, check out our preprint.
To reproduce the figures from our preprint, please see the following repository: https://github.com/anna-neufeld/datathin_paper.
The scRNA-seq dataset analyzed in our paper is available for free from 10X genomics, and is also included in the R package countsplit.