Welcome to the datathin
package. In this tutorial, we
will show how random variables from different distributions can be split
into independent training and test components. For more details, see our
pre-print. For information on how to use datathin
for tasks
such as model evaluation or inference after model selection, see our
forthcoming tutorials.
To get started, ensure that you have downloaded the package. Make
sure that remotes
is installed by running
install.packages("remotes")
, then type
remotes::install_github("anna-neufeld/datathin")
The Poisson distribution is simple, because there is only one parameter.
We first generate a vector of 10,000 \(Poisson(7)\) random variables.
Next, we thin the data with the datathin
function. The
output of datathin
is an array where the last dimension
indexes the folds of the thinned data.
dat.thin <- datathin(dat, family="poisson", epsilon=c(0.3, 0.7))
dat.train <- dat.thin[,,1]
dat.test <- dat.thin[,,2]
We now verify several properties of the data thinning operation.
First, we verify that the expected value of dat.train is \(0.3 \times 7\) and the expected value of dat.test is \(0.7 \times 7\).
mean(dat)
## [1] 7.0001
mean(dat.train)
## [1] 2.1028
0.3*7
## [1] 2.1
mean(dat.test)
## [1] 4.8973
0.7*7
## [1] 4.9
We next verify that dat.train
and dat.test
are independent, and that they sum to dat
. Throughout this
document, we use small empirical correlations as evidence of
independence.
all.equal(dat, as.numeric(dat.train+dat.test))
## [1] TRUE
cor(dat.train, dat.test)
## [1] 0.006371561
We next consider the exponential distribution.
For this distribution, we will create 5 independent, symmetric folds
of the K
argument in the datathin
function.
folds <- datathin(dat, family="exponential", K=5)
Here, folds
is a \(100,000
\times 1 \times 5\) array.
We can verify that, for \(m=1,\ldots,5\), the \(m\)th fold of data is independent of the sum of the remaining folds, and that it has mean \(\frac{1}{5} \times 5 = 1\), while the remainder has mean \(\frac{4}{5} \times 5 = 4\). Once again, we use small empirical correlations as evidence of independence.
for (m in 1:5) {
print("-------")
print(paste("fold", m))
dat.test <- folds[,,m]
dat.train <- dat - folds[,,m]
print(paste("Test set mean", round(mean(dat.test), 3)))
print(paste("Training set mean:", round(mean(dat.train), 3)))
print(paste("Sample correlation:", round(as.numeric(cor(dat.test, dat.train)),4)))
}
## [1] "-------"
## [1] "fold 1"
## [1] "Test set mean 0.985"
## [1] "Training set mean: 4.005"
## [1] "Sample correlation: 0.0062"
## [1] "-------"
## [1] "fold 2"
## [1] "Test set mean 1"
## [1] "Training set mean: 3.991"
## [1] "Sample correlation: 0.0063"
## [1] "-------"
## [1] "fold 3"
## [1] "Test set mean 0.999"
## [1] "Training set mean: 3.991"
## [1] "Sample correlation: -0.001"
## [1] "-------"
## [1] "fold 4"
## [1] "Test set mean 1.002"
## [1] "Training set mean: 3.988"
## [1] "Sample correlation: 7e-04"
## [1] "-------"
## [1] "fold 5"
## [1] "Test set mean 1.004"
## [1] "Training set mean: 3.986"
## [1] "Sample correlation: 0.0011"
The results for the exponential family can be extended to any gamma
distribution where the shape parameter is known. To thin the gamma
distribution we will provide the shape to the datathin
function using the arg
argument. Note that by omitting the
epsilon
parameter, we are requesting symmetric training and
test sets.
We now show an example of thinning the normal distribution. This is slightly more complicated, as the \(N(\mu, \sigma^2)\) distribution has two parameters. We will demonstrate data thinning strategies when either the mean \(\mu\) or the variance \(\sigma^2\) are unknown, and the other is known.
We start by generating data from a \(N(5,
2)\) distribution. This time, we let our dataset dat
be a matrix with dimensions \(10000 \times
10\).
We first apply data thinning assuming that \(\sigma^2=2\) is known. This is passed in as
the arg
parameter. We can see that, after applying data
thinning, each column of the training set is independent of each column
in the test set (once again, we are using small empirical correlations
as evidence of independence).
res <- datathin(dat, family="normal", arg=2)
dat.train <- res[,,1]
dat.test <- res[,,2]
cors <- sapply(1:ncol(dat.train), function(u) round(cor(dat.train[,u], dat.test[,u]), 4))
print(paste("Correlation between train and test in column", 1:10, ":", cors))
## [1] "Correlation between train and test in column 1 : -0.005"
## [2] "Correlation between train and test in column 2 : 0.0075"
## [3] "Correlation between train and test in column 3 : 0.0153"
## [4] "Correlation between train and test in column 4 : -0.0066"
## [5] "Correlation between train and test in column 5 : 0.0138"
## [6] "Correlation between train and test in column 6 : -0.0038"
## [7] "Correlation between train and test in column 7 : 0.0267"
## [8] "Correlation between train and test in column 8 : 0.019"
## [9] "Correlation between train and test in column 9 : -0.0013"
## [10] "Correlation between train and test in column 10 : -0.009"
The parameter arg
must either be a scalar (in which case
it is assumed that all elements of the data matrix have the same value
of \(\sigma^2\)), or its dimensions
must match that of the data.
We now explore the impact of using the “wrong” value of \(\sigma^2\) while splitting. We generate a dataset with \(100,000\) rows and \(3\) columns. In the first column, the data are from a \(N(5, 0.1)\) distribution. In the second column, they are from a \(N(5,2)\) distribution. In the final column, they are from a \(N(5,20)\) distribution.
dat <- cbind(rnorm(100000, mean=5, sd=sqrt(0.1)),
rnorm(100000, mean=5, sd=sqrt(2)),
rnorm(100000, mean=5, sd=sqrt(20)))
First, we datathin but we incorrectly assume that all items in the matrix have \(\sigma^2=2\). We see that we have negative correlation between the training and test sets in the first column (where we used a value of \(\sigma^2\) that was too big), no correlation in the second column (where we used the correct value of \(\sigma^2\)), and positive correlation in the last column (where we used a value of \(\sigma^2\) that was too small). See our preprint for further explanation of this property.
res <- datathin(dat, family="normal", arg=2)
dat.train <- res[,,1]
dat.test <- res[,,2]
cors <- sapply(1:ncol(dat.train), function(u) round(cor(dat.train[,u], dat.test[,u]), 4))
print(paste("Correlation between train and test in column", 1:3, ":", cors))
## [1] "Correlation between train and test in column 1 : -0.9053"
## [2] "Correlation between train and test in column 2 : 0.0078"
## [3] "Correlation between train and test in column 3 : 0.8191"
To remedy the situation, we must let arg
be a matrix
that stores the correct \(\sigma^2\)
values.
good_args <- cbind(rep(0.1, 100000), rep(2, 100000), rep(20,100000))
res <- datathin(dat, family="normal", arg=good_args)
dat.train <- res[,,1]
dat.test <- res[,,2]
cors <- sapply(1:ncol(dat.train), function(u) round(as.numeric(cor(dat.train[,u], dat.test[,u])),4))
print(paste("Correlation between train and test in column", 1:3, ":", cors))
## [1] "Correlation between train and test in column 1 : 9e-04"
## [2] "Correlation between train and test in column 2 : 0.0066"
## [3] "Correlation between train and test in column 3 : 0.0051"
We quickly note that an error will be thrown if arg
is
not a scalar and its dimensions do not match those of the data. The
following code results in an error:
We can also thin the normal distribution with unknown variance by recognizing that for \(X\sim N(\mu,\sigma^2)\), \((X-\mu)^2 \sim \sigma^2 \chi_1^2 = \text{Gamma}\left(\frac{1}{2},\frac{1}{2\sigma^2}\right)\). In other words, we can convert our normal data with knonw mean into gamma data with known shape, thus allowing us to apply gamma data thinning. This is an example of indirect thinning; see our second preprint for more details.
dat <- matrix(rnorm(10000*10, mean=5, sd=sqrt(2)), nrow=10000)
var(as.vector(dat))
## [1] 1.989115
res <- datathin(dat, family="normal-variance", arg=5)
dat.train <- res[,,1]
dat.test <- res[,,2]
mean(dat.train)
## [1] 0.9877177
mean(dat.test)
## [1] 1.001414
as.numeric(cor(as.vector(dat.train), as.vector(dat.test)))
## [1] 0.000388819
We generate data from a negative binomial distribution. We use the mean and overdispersion parameterization of the negative binomial distribution.
dat <- rnbinom(100000, size=7, mu = 6)
res <- datathin(dat, family="negative binomial", epsilon = c(0.2,0.8), arg=7)
dat.train <- res[,,1]
dat.test <- res[,,2]
0.2*6
## [1] 1.2
mean(dat.train)
## [1] 1.19873
0.8*6
## [1] 4.8
mean(dat.test)
## [1] 4.80254
as.numeric(cor(dat.train, dat.test))
## [1] -0.001117663
We verify the same properties as above, assuming that the size parameter is known. Note that, if we want to use \(\epsilon=0.5\), the size parameter of the binomial distribution must be an even number.
The gamma distributions is extremely flexible, and shows up in many thinning recipes. Indeed, we saw this above when we thinned the normal distribution with unknown variance using the gamma distribution. Here we showcase similar strategies that also use the gamma distribution in some way; see our second preprint for further details.
The chi-squared distribution is a special case of the gamma distribution, where the shape is equal to half of the degrees of freedom. Using this relationship, we can thin a scaled chi-squared distribution with \(K\) degrees of freedom into \(K\) normal random variables.
Similarly, we can thin the gamma distribution into the Weibull distribution.
dat <- rgamma(10000, shape=4, rate=4)
res <- datathin(dat, family="gamma-weibull", K=4, arg=3)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", (4^(-1/3))*gamma(1+1/3)))
## [1] "Expected mean: 0.56254184187547"
mean(dat1)
## [1] 0.5616192
mean(dat2)
## [1] 0.564796
as.numeric(cor(dat1, dat2))
## [1] -0.01063155
Similarly, we can thin the Pareto distribution with known scale into gamma random variables.
dat <- rpareto(100000, 6, 2)
res <- datathin(dat, family="pareto", K=2, arg=2)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", (1/2)/6))
## [1] "Expected mean: 0.0833333333333333"
mean(dat1)
## [1] 0.08343128
mean(dat2)
## [1] 0.08318186
as.numeric(cor(dat1, dat2))
## [1] -0.001256507
Data thinning recipes are also available for a subset of multivariate distributions. In both of the following distributions, care must be taken when providing the known parameter.
As the multivariate normal distribution is vector-valued, each row of
the \(n \times p\) data
matrix corresponds to one observation. To thin this data, we must
provide the covariance matrix to the datathin
function.
Either arg
can be a single \(p
\times p\) covariance matrix, or a \(n
\times p \times p\) array of covariance matrices.
rho <- 0.6
Sig <- matrix(0, nrow=4, ncol=4)
for (i in 1:4) {
for (j in 1:4) {
Sig[i,j] <- rho^abs(i-j)
}
}
dat <- rmvnorm(100000, c(1,2,3,4), Sig)
colMeans(dat)
## [1] 0.9955463 1.9997193 2.9981161 3.9996550
res <- datathin(dat, family="mvnormal", arg=Sig)
dat.train <- res[,,1]
dat.test <- res[,,2]
colMeans(dat.train)
## [1] 0.499857 1.000230 1.496200 1.999099
colMeans(dat.test)
## [1] 0.4956893 0.9994894 1.5019156 2.0005555
max(abs(cor(dat.train, dat.test)))
## [1] 0.005323221
For the multinomial distribution, each row of the \(n\times p\) data matrix represents a
realization from the multinomial distribution. The input to
arg
is then either a scalar, or a vector of length \(n\). Similar to the binomial distribution,
the symmetric folds require an even number of trials.
dat <- t(rmultinom(100000, 20, c(0.1,0.2,0.3,0.4)))
colMeans(dat)
## [1] 1.99914 4.00629 5.99685 7.99772
res <- datathin(dat, family="multinomial", arg=20)
dat.train <- res[,,1]
dat.test <- res[,,2]
colMeans(dat.train)
## [1] 0.99864 2.00189 2.99914 4.00033
colMeans(dat.test)
## [1] 1.00050 2.00440 2.99771 3.99739
max(abs(cor(dat.train, dat.test)))
## [1] 0.003064689
Outside of the exponential family, certain shift and scale families can also be thinned.
When \(X\sim \theta \text{Beta}(\alpha,1)\), \(X\) can be thinned so long as \(\alpha\) is known. If \(\alpha=1\), then this reduces to the simpler setting of thinning \(X\sim\text{Uniform}(0,\theta)\).
dat <- 10*rbeta(100000, 7, 1)
res <- datathin(dat, family="scaled-beta", K=2, arg=7)
dat1 <- res[,,1]
dat2 <- res[,,2]
print(paste("Expected mean:", 10*(7/2)/(7/2 + 1)))
## [1] "Expected mean: 7.77777777777778"
mean(dat1)
## [1] 7.77505
mean(dat2)
## [1] 7.780858
as.numeric(cor(dat1, dat2))
## [1] 0.003765505