Introductory tutorial • datathin

Welcome to the datathin package. In this tutorial, we will show how random variables from different distributions can be split into independent training and test components. For more details, see our pre-print. For information on how to use datathin for tasks such as model evaluation or inference after model selection, see our forthcoming tutorials.

To get started, ensure that you have downloaded the package. Make sure that remotes is installed by running install.packages("remotes"), then type

remotes::install_github("anna-neufeld/datathin")

library(datathin)
library(mvtnorm)
library(extraDistr)
library(ggplot2)
library(patchwork)

Poisson

The Poisson distribution is simple, because there is only one parameter.

We first generate a vector of 10,000 \(Poisson(7)\) random variables.

set.seed(1)
dat <- rpois(10000, 7)

Next, we thin the data with the datathin function. The output of datathin is an array where the last dimension indexes the folds of the thinned data.

dat.thin <- datathin(dat, family="poisson", epsilon=c(0.3, 0.7))
dat.train <- dat.thin[,,1]
dat.test <- dat.thin[,,2]

We now verify several properties of the data thinning operation.

First, we verify that the expected value of dat.train is \(0.3 \times 7\) and the expected value of dat.test is \(0.7 \times 7\).

mean(dat)
## [1] 7.0001
mean(dat.train)
## [1] 2.1028
0.3*7
## [1] 2.1
mean(dat.test)
## [1] 4.8973
0.7*7
## [1] 4.9

We next verify that dat.train and dat.test are independent, and that they sum to dat. Throughout this document, we use small empirical correlations as evidence of independence.

all.equal(dat, as.numeric(dat.train+dat.test))
## [1] TRUE
cor(dat.train, dat.test)
## [1] 0.006371561

Exponential

We next consider the exponential distribution.

set.seed(2)
dat <- rexp(100000, rate=1/5)
mean(dat)
## [1] 4.990168

For this distribution, we will create 5 independent, symmetric folds of the K argument in the datathin function.

folds <- datathin(dat, family="exponential", K=5)

Here, folds is a \(100,000 \times 1 \times 5\) array.

dim(folds)[3]
## [1] 5
length(folds[,,3])
## [1] 100000

We can verify that, for \(m=1,\ldots,5\), the \(m\)th fold of data is independent of the sum of the remaining folds, and that it has mean \(\frac{1}{5} \times 5 = 1\), while the remainder has mean \(\frac{4}{5} \times 5 = 4\). Once again, we use small empirical correlations as evidence of independence.

for (m in 1:5) {
  print("-------")
  print(paste("fold", m))
  
  dat.test <- folds[,,m]
  dat.train <- dat - folds[,,m]
  
  print(paste("Test set mean", round(mean(dat.test), 3)))
  print(paste("Training set mean:", round(mean(dat.train), 3)))
  print(paste("Sample correlation:", round(as.numeric(cor(dat.test, dat.train)),4)))
}
## [1] "-------"
## [1] "fold 1"
## [1] "Test set mean 0.985"
## [1] "Training set mean: 4.005"
## [1] "Sample correlation: 0.0062"
## [1] "-------"
## [1] "fold 2"
## [1] "Test set mean 1"
## [1] "Training set mean: 3.991"
## [1] "Sample correlation: 0.0063"
## [1] "-------"
## [1] "fold 3"
## [1] "Test set mean 0.999"
## [1] "Training set mean: 3.991"
## [1] "Sample correlation: -0.001"
## [1] "-------"
## [1] "fold 4"
## [1] "Test set mean 1.002"
## [1] "Training set mean: 3.988"
## [1] "Sample correlation: 7e-04"
## [1] "-------"
## [1] "fold 5"
## [1] "Test set mean 1.004"
## [1] "Training set mean: 3.986"
## [1] "Sample correlation: 0.0011"

Gamma

The results for the exponential family can be extended to any gamma distribution where the shape parameter is known. To thin the gamma distribution we will provide the shape to the datathin function using the arg argument. Note that by omitting the epsilon parameter, we are requesting symmetric training and test sets.

dat <- rgamma(10000, shape=12, rate=2)
mean(dat)
## [1] 5.992822
res <- datathin(dat, family="gamma", arg=12)
dat.train <- res[,,1]
dat.test <- res[,,2]
mean(dat.train)
## [1] 3.004014
mean(dat.test)
## [1] 2.988808
as.numeric(cor(dat.train, dat.test))
## [1] -0.01479233

Normal distribution

We now show an example of thinning the normal distribution. This is slightly more complicated, as the \(N(\mu, \sigma^2)\) distribution has two parameters. We will demonstrate data thinning strategies when either the mean \(\mu\) or the variance \(\sigma^2\) are unknown, and the other is known.

We start by generating data from a \(N(5, 2)\) distribution. This time, we let our dataset dat be a matrix with dimensions \(10000 \times 10\).

set.seed(3)
dat <- matrix(rnorm(10000*10, mean=5, sd=sqrt(2)), nrow=10000)

Thinning the normal distribution with unknown mean

We first apply data thinning assuming that \(\sigma^2=2\) is known. This is passed in as the arg parameter. We can see that, after applying data thinning, each column of the training set is independent of each column in the test set (once again, we are using small empirical correlations as evidence of independence).

res <- datathin(dat, family="normal", arg=2)
dat.train <- res[,,1]
dat.test <- res[,,2]

cors <- sapply(1:ncol(dat.train), function(u) round(cor(dat.train[,u], dat.test[,u]), 4))
print(paste("Correlation between train and test in column", 1:10, ":", cors))
##  [1] "Correlation between train and test in column 1 : -0.005" 
##  [2] "Correlation between train and test in column 2 : 0.0075" 
##  [3] "Correlation between train and test in column 3 : 0.0153" 
##  [4] "Correlation between train and test in column 4 : -0.0066"
##  [5] "Correlation between train and test in column 5 : 0.0138" 
##  [6] "Correlation between train and test in column 6 : -0.0038"
##  [7] "Correlation between train and test in column 7 : 0.0267" 
##  [8] "Correlation between train and test in column 8 : 0.019"  
##  [9] "Correlation between train and test in column 9 : -0.0013"
## [10] "Correlation between train and test in column 10 : -0.009"

The parameter arg must either be a scalar (in which case it is assumed that all elements of the data matrix have the same value of \(\sigma^2\)), or its dimensions must match that of the data.

We now explore the impact of using the “wrong” value of \(\sigma^2\) while splitting. We generate a dataset with \(100,000\) rows and \(3\) columns. In the first column, the data are from a \(N(5, 0.1)\) distribution. In the second column, they are from a \(N(5,2)\) distribution. In the final column, they are from a \(N(5,20)\) distribution.

dat <- cbind(rnorm(100000, mean=5, sd=sqrt(0.1)),
             rnorm(100000, mean=5, sd=sqrt(2)),
             rnorm(100000, mean=5, sd=sqrt(20)))

First, we datathin but we incorrectly assume that all items in the matrix have \(\sigma^2=2\). We see that we have negative correlation between the training and test sets in the first column (where we used a value of \(\sigma^2\) that was too big), no correlation in the second column (where we used the correct value of \(\sigma^2\)), and positive correlation in the last column (where we used a value of \(\sigma^2\) that was too small). See our preprint for further explanation of this property.

res <- datathin(dat, family="normal", arg=2)
dat.train <- res[,,1]
dat.test <- res[,,2]
cors <- sapply(1:ncol(dat.train), function(u) round(cor(dat.train[,u], dat.test[,u]), 4))
print(paste("Correlation between train and test in column", 1:3, ":", cors))
## [1] "Correlation between train and test in column 1 : -0.9053"
## [2] "Correlation between train and test in column 2 : 0.0078" 
## [3] "Correlation between train and test in column 3 : 0.8191"

To remedy the situation, we must let arg be a matrix that stores the correct \(\sigma^2\) values.

good_args <- cbind(rep(0.1, 100000), rep(2, 100000), rep(20,100000))
res <- datathin(dat, family="normal", arg=good_args)
dat.train <- res[,,1]
dat.test <- res[,,2]
cors <- sapply(1:ncol(dat.train), function(u) round(as.numeric(cor(dat.train[,u], dat.test[,u])),4))
print(paste("Correlation between train and test in column", 1:3, ":", cors))
## [1] "Correlation between train and test in column 1 : 9e-04" 
## [2] "Correlation between train and test in column 2 : 0.0066"
## [3] "Correlation between train and test in column 3 : 0.0051"

We quickly note that an error will be thrown if arg is not a scalar and its dimensions do not match those of the data. The following code results in an error:

res <- datathin(dat, family="normal", arg=c(0.1,2,20))

Thinning the normal distribution with unknown variance

We can also thin the normal distribution with unknown variance by recognizing that for \(X\sim N(\mu,\sigma^2)\), \((X-\mu)^2 \sim \sigma^2 \chi_1^2 = \text{Gamma}\left(\frac{1}{2},\frac{1}{2\sigma^2}\right)\). In other words, we can convert our normal data with knonw mean into gamma data with known shape, thus allowing us to apply gamma data thinning. This is an example of indirect thinning; see our second preprint for more details.

dat <- matrix(rnorm(10000*10, mean=5, sd=sqrt(2)), nrow=10000)
var(as.vector(dat))
## [1] 1.989115
res <- datathin(dat, family="normal-variance", arg=5)
dat.train <- res[,,1]
dat.test <- res[,,2]
mean(dat.train)
## [1] 0.9877177
mean(dat.test)
## [1] 1.001414
as.numeric(cor(as.vector(dat.train), as.vector(dat.test)))
## [1] 0.000388819

Negative binomial

We generate data from a negative binomial distribution. We use the mean and overdispersion parameterization of the negative binomial distribution.

dat <- rnbinom(100000, size=7, mu = 6)
res <- datathin(dat, family="negative binomial", epsilon = c(0.2,0.8), arg=7)
dat.train <- res[,,1]
dat.test <- res[,,2]
0.2*6
## [1] 1.2
mean(dat.train)
## [1] 1.19873
0.8*6
## [1] 4.8
mean(dat.test)
## [1] 4.80254
as.numeric(cor(dat.train, dat.test))
## [1] -0.001117663

Binomial

We verify the same properties as above, assuming that the size parameter is known. Note that, if we want to use \(\epsilon=0.5\), the size parameter of the binomial distribution must be an even number.

dat <- rbinom(10000, 16, 0.25)
mean(dat)
## [1] 3.98
res <- datathin(dat, family="binomial", arg=16)
dat.train <- res[,,1]
dat.test <- res[,,2]
mean(dat.train)
## [1] 2.0008
mean(dat.test)
## [1] 1.9792
as.numeric(cor(dat.train, dat.test))
## [1] -0.009440734

Additional gamma families

The gamma distributions is extremely flexible, and shows up in many thinning recipes. Indeed, we saw this above when we thinned the normal distribution with unknown variance using the gamma distribution. Here we showcase similar strategies that also use the gamma distribution in some way; see our second preprint for further details.

Chi-squared

The chi-squared distribution is a special case of the gamma distribution, where the shape is equal to half of the degrees of freedom. Using this relationship, we can thin a scaled chi-squared distribution with \(K\) degrees of freedom into \(K\) normal random variables.

dat <- 3*rchisq(10000, df=5)
mean(dat)/5
## [1] 2.997949
res <- datathin(dat, family="chi-squared", K=5)
dat1 <- res[,,1]
dat2 <- res[,,2]
var(dat1)
## [1] 3.025864
var(dat2)
## [1] 2.983546
as.numeric(cor(dat1, dat2))
## [1] -0.009146341

Gamma-Weibull

Similarly, we can thin the gamma distribution into the Weibull distribution.

dat <- rgamma(10000, shape=4, rate=4)
res <- datathin(dat, family="gamma-weibull", K=4, arg=3)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", (4^(-1/3))*gamma(1+1/3)))
## [1] "Expected mean: 0.56254184187547"
mean(dat1)
## [1] 0.5616192
mean(dat2)
## [1] 0.564796
as.numeric(cor(dat1, dat2))
## [1] -0.01063155

Weibull

We can also thin the Weibull distribution with known shape into gamma random variables.

dat <- rweibull(100000, 5, 2)
res <- datathin(dat, family="weibull", K=2, arg=5)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", (1/2)*(2^(5))))
## [1] "Expected mean: 16"
mean(dat1)
## [1] 15.93477
mean(dat2)
## [1] 15.98874
as.numeric(cor(dat1, dat2))
## [1] 0.00400066

Pareto

Similarly, we can thin the Pareto distribution with known scale into gamma random variables.

dat <- rpareto(100000, 6, 2)
res <- datathin(dat, family="pareto", K=2, arg=2)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", (1/2)/6))
## [1] "Expected mean: 0.0833333333333333"
mean(dat1)
## [1] 0.08343128
mean(dat2)
## [1] 0.08318186
as.numeric(cor(dat1, dat2))
## [1] -0.001256507

Multivariate distributions

Data thinning recipes are also available for a subset of multivariate distributions. In both of the following distributions, care must be taken when providing the known parameter.

Multivariate normal

As the multivariate normal distribution is vector-valued, each row of the \(n \times p\) data matrix corresponds to one observation. To thin this data, we must provide the covariance matrix to the datathin function. Either arg can be a single \(p \times p\) covariance matrix, or a \(n \times p \times p\) array of covariance matrices.

rho <- 0.6
Sig <- matrix(0, nrow=4, ncol=4)
for (i in 1:4) {
  for (j in 1:4) {
    Sig[i,j] <- rho^abs(i-j)
  }
}
dat <- rmvnorm(100000, c(1,2,3,4), Sig)
colMeans(dat)
## [1] 0.9955463 1.9997193 2.9981161 3.9996550
res <- datathin(dat, family="mvnormal", arg=Sig)
dat.train <- res[,,1]
dat.test <- res[,,2]
colMeans(dat.train)
## [1] 0.499857 1.000230 1.496200 1.999099
colMeans(dat.test)
## [1] 0.4956893 0.9994894 1.5019156 2.0005555
max(abs(cor(dat.train, dat.test)))
## [1] 0.005323221

Multinomial

For the multinomial distribution, each row of the \(n\times p\) data matrix represents a realization from the multinomial distribution. The input to arg is then either a scalar, or a vector of length \(n\). Similar to the binomial distribution, the symmetric folds require an even number of trials.

dat <- t(rmultinom(100000, 20, c(0.1,0.2,0.3,0.4)))
colMeans(dat)
## [1] 1.99914 4.00629 5.99685 7.99772
res <- datathin(dat, family="multinomial", arg=20)
dat.train <- res[,,1]
dat.test <- res[,,2]
colMeans(dat.train)
## [1] 0.99864 2.00189 2.99914 4.00033
colMeans(dat.test)
## [1] 1.00050 2.00440 2.99771 3.99739
max(abs(cor(dat.train, dat.test)))
## [1] 0.003064689

Shift and scale families

Outside of the exponential family, certain shift and scale families can also be thinned.

Scaled beta

When \(X\sim \theta \text{Beta}(\alpha,1)\), \(X\) can be thinned so long as \(\alpha\) is known. If \(\alpha=1\), then this reduces to the simpler setting of thinning \(X\sim\text{Uniform}(0,\theta)\).

dat <- 10*rbeta(100000, 7, 1)
res <- datathin(dat, family="scaled-beta", K=2, arg=7)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", 10*(7/2)/(7/2 + 1)))
## [1] "Expected mean: 7.77777777777778"
mean(dat1)
## [1] 7.77505
mean(dat2)
## [1] 7.780858
as.numeric(cor(dat1, dat2))
## [1] 0.003765505

Shifted exponential

We can also thin a shifted exponential, assuming that the rate is known.

dat <- 6 + rexp(100000, 2)
res <- datathin(dat, family="shifted-exponential", K=2, arg=2)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", 6 + 1))
## [1] "Expected mean: 7"
mean(dat1)
## [1] 6.993169
mean(dat2)
## [1] 7.004156
as.numeric(cor(dat1, dat2))
## [1] 0.0007561852