monocle3_tutorial.Rmd
Before using this tutorial, we recommend that you read through our introductory tutorial to understand our method in a simple example with simulated data.
For this tutorial, we reproduce the Monocle3 vignette, but we apply count splitting. Throughout the tutorial, we see that we reach similar takeaways to those reached in the Monocle3 vignette, despite the fact that we estimate pseudotime using only the training set.
If you don’t already have Monocle3
, you will need to
install it. Please visit this
link for installation details. As Monocle3
is in Beta,
there may be some trouble shooting involved.
Next, you should load the package, along with others that we will use in this tutorial.
library(monocle3)
library(tidyverse)
library(countsplit)
We load the same data that is used in the Monocle3 vignette.
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds"))
We now deviate from the Monocle3 vignette to apply count splitting.
We apply Poisson count splitting directly to the raw expression matrix.
We then construct a cell_data_set
object (used by the
Monocle3 package) that contains only the training set counts.
epsilon=0.5
set.seed(1)
#Xtrain <-apply(expression_matrix,2,function(u) rbinom(n=length(u), size=u, p=epsilon))
#Xtest <- expression_matrix-Xtrain
split.pcs <- countsplit(expression_matrix)
Xtrain <- split.pcs[[1]]
Xtest <- split.pcs[[2]]
cds_train <- new_cell_data_set(
Xtrain,
cell_metadata = cell_metadata,
gene_metadata = gene_annotation
)
This pipeline assumes that the data are reasonably close to following a Poisson distribution. In a separate tutorial, we will consider negative binomial count splitting.
We now follow the preprocessing and pseudotime steps from the Monocle3 tutorial to come up with pseudotime estimates. Everything in this section uses the training set only.
cds_train <- preprocess_cds(cds_train, num_dim = 50)
cds_train <- align_cds(
cds_train,
alignment_group = "batch",
residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading +
bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading"
)
The plot below shows that the cells cluster nicely by their cell type in reduced dimensions. This suggests that count splitting the data has not destroyed the signal in the data.
cds_train <- reduce_dimension(cds_train)
plot_cells(cds_train,
color_cells_by = "cell.type",
label_groups_by_cluster=FALSE,
label_cell_groups=FALSE,
graph_label_size=1.5
)
In this section, we use the training dataset to learn the graph that
will be used to estimate pseudotime. For simplicity in this tutorial, we
would like to only have one pseudotime trajectory with one root node.
Thus, we add the use_partition=FALSE
argument to the
learn_graph
function to ensure that our graph is made up of
a single connected trajectory. We use Louvain clustering, rather than
the default Leiden clustering of cluster_cells
, to avoid a
recent error with the Monocle3 package and the implementation of Leiden
clustering.
cds_train <- cluster_cells(cds_train, cluster_method="louvain")
cds_train <- learn_graph(cds_train, use_partition=FALSE)
Once the graph is learned, Monocle3 computes pseudotime with the
order_cells()
function. This function simply projects the
cells onto the principal graph to get a one dimensional representation.
The only additional step is picking a root node in the graph that will
correspond to 0
in pseudotime space. To pick a root node
for pseudotime, we use the same method as in the Monocle3 vignette.
We pick the cell from the first time bin that is closest to a graph
vertex.
get_earliest_principal_node <- function(cds, time_bin="130-170"){
cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
closest_vertex <-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
root_pr_nodes <-
igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
(which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes
}
cds_train <- order_cells(cds_train, root_pr_nodes=get_earliest_principal_node(cds_train))
We now have estimated pseudotime, and we can plot our learned graph with cells colored by pseudotime.
plot_cells(cds_train,
color_cells_by = "pseudotime",
label_cell_groups=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE,
graph_label_size=1.5)
Now that we have estimated pseudotime using the training set, we are ready to study differential expression along pseudotime. In this section, we rely on the pseudotime from the training set but we rely on the test set counts for differential expression analysis.
First, we carry out differential expression testing “by hand”. We fit
a quasipoisson GLM of each test set gene onto the pseudotime computed
from the training set. (We choose quasipoisson since this is the default
family for the fit_models()
function in Monocle3.) For
computational efficiency, we only do this for the first 500 genes. We
include the size factors estimated on the training as offsets in the
regression.
pseudotime.train <- pseudotime(cds_train)
sf.train <- colData(cds_train)$Size_Factor
results <- apply(Xtest[1:500,], 1, function(u) summary(glm(u~pseudotime.train, offset = log(sf.train), family="quasipoisson"))$coefficients[2,2])
head(results, n=10)
## WBGene00010957 WBGene00010958 WBGene00010959 WBGene00010960 WBGene00010961
## 0.001890245 0.006411465 0.002208787 0.002007567 0.004345681
## WBGene00000829 WBGene00010962 WBGene00010963 WBGene00010964 WBGene00010965
## 0.002515883 0.001990105 0.003296668 0.002181367 0.001858456
The results above show the differential expression p-value for each of the first 10 genes in the dataset.
This section follows the “finding genes that change as a function of
pseudotime” section
of the Monocle3 vignette. The differential expression testing method
recommended in the Monocle3 tutorial is graph_test()
, which
uses Moran’s I statistics to test if cells that are adjacent in
pseudotime space have similar expression values for certain genes.
In this section, we will need our test set to be stored in a
cell_data_set
that contains the test set counts. We need
this object to store the training set graph. To accomplish this, we make
a copy of cds_train
and only update the counts.
cds_test <- cds_train
counts(cds_test) <- Xtest
We now run the graph_test
function, which is the
preferred Monocle3 methodology for finding genes that vary across
pseudotime.
ciliated_cds_pr_test_res <- graph_test(cds_test, neighbor_graph="principal_graph", cores=4)
pr_deg_ids <- row.names(subset(ciliated_cds_pr_test_res, q_value < 0.0005))
Here are a couple of interesting genes that score as highly
significant according to graph_test()
, both in our analysis
and in the analysis in the Monocle3 vignette. We can see that the
expression levels of these genes do vary a lot in pseudotime space
(these genes are highly expressed in very specific regions of the
graph).
plot_cells(cds_test, genes=c("hlh-4", "gcy-8", "dac-1", "oig-8"),
show_trajectory_graph=FALSE,
label_cell_groups=FALSE,
label_leaves=FALSE)
We can also select specific genes and plot their expression across
pseudotime. As in the Monocle3 vignette, we see that all three of these
genes get “activated” midway through the trajectory, but
dac-1
is activated slightly before the others.
AFD_genes <- c("gcy-8", "dac-1", "oig-8")
AFD_lineage_cds <- cds_test[rowData(cds_test)$gene_short_name %in% AFD_genes,
colData(cds_test)$cell.type %in% c("AFD")]
plot_genes_in_pseudotime(AFD_lineage_cds, color_cells_by="embryo.time.bin",min_expr=0.5)
The analysis in this tutorial identifies differentially expressed genes without double dipping in the data, and thus improves upon the analysis in the Monocle3 vignette.