Suppose that we use a CART regression tree (Breiman et al. 1984), fit using the rpart package (Therneau and Atkinson 2019), to identify subgroups of patients in the Box Lunch Study (BLS) who share similar values of the response variable and are also described by common covariate values. The BLS dataset was originally provided in the visTree package (Venkatasubramaniam and Wolfson (2018)) and is described in detail in Venkatasubramaniam et al. (2017). The response variable is the number of calories consumed in 24 hours and the covariates include measures of eating habits and attitudes towards food. One possible CART tree fit to this dataset is shown below.

library(rpart)
library(rpart.plot)
data(blsdata, package="treevalues")
bls.tree <-rpart(kcal24h0~hunger+disinhibition+resteating+rrvfood+liking+wanting, model = TRUE, data = blsdata, cp=0.02)
rpart.plot(bls.tree, extra=1, nn=TRUE)

Suppose that we want to know whether the first split in the tree is statistically significant. In other words, we want to know if the average number of calories consumed by individuals with hunger < 11 is truly different than the average number of calories consumed by individuals with hunger >= 11. Because we used our data to select these two subgroups, we cannot naively “double dip” in the same data to test for a difference in means between the two subgroups. The package treevalues provides a way to test for this difference in means without double-dipping in the data. The package implements the selective inference framework described in Neufeld, Gao, and Witten (2021); for more details, see the preprint.

Apart from providing p-values for the difference in means across binary splits, the framework in treevalues also provides confidence intervals for the mean response within a given region. For more details on how to use the package to conduct inference, see the tutorial.

Make sure that devtools is installed by running install.packages("devtools"), then type

devtools::install_github("anna-neufeld/treevalues")

library(treevalues)
treeval.plot(bls.tree)

References

Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.

Neufeld, Anna, Lucy Gao, and Daniela Witten. 2021. “Tree-Values: Selective Inference for Regression Trees.” arXiv Preprint arXiv:2106.07816.

Therneau, Terry, and Beth Atkinson. 2019. Rpart: Recursive Partitioning and Regression Trees.

Venkatasubramaniam, Ashwini, and Julian Wolfson. 2018. visTree: Visualization of Subgroups for Decision Trees.

Venkatasubramaniam, Ashwini, Julian Wolfson, Nathan Mitchell, Timothy Barnes, Meghan JaKa, and Simone French. 2017. “Decision Trees in Epidemiological Research.” Emerging Themes in Epidemiology 14 (1): 11.

treevalues: Selective Inference for Regression Trees

References