`scran_tutorial.Rmd`

Before using this tutorial, we recommend that you read through our introductory tutorial to understand our method in a simple example with simulated data.

In this tutorial, we use a real dataset from Elorbany et al. (2022). The dataset contains 10,000 cells collected over 15 days of a directed differentiation protocol from induced pluripotent stem cells (IPSC) to cardiomyocytes (cm).

After loading the necessary software and performing count splitting, in this tutorial we carry out the following steps.

**Preprocessing:**We preprocess the training data using the preprocessing steps suggested by the scran tutorial.**Clustering:**We cluster the cells using`scran`

using the preprocessed training data.**Differential Expression Analysis:**Using the test data, we test to see which genes are associated with the estimated clusters. We do this both “by hand” using Poisson GLMs as well as using the built-in methods from the`scran`

package.

If you don’t already have `scran`

, you will need to
run:

```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scran")
```

Next, you should load the package, along with others that we will use in this tutorial.

This data is included in the `countsplit_tutorials`

package as a `SingleCellExperiment`

object, so it is simple
to load.

```
data(cm)
cm
```

```
## class: SingleCellExperiment
## dim: 21971 10000
## metadata(0):
## assays(1): counts
## rownames(21971): AL627309.1 AL627309.6 ... AC136352.4 AC007325.4
## rowData names(0):
## colnames(10000): E1_E1CD3col4_CATTTGTGCTTG E1_E1CD3col2_AGAATAAGTCAC
## ... E1_E1CD1col5_GTTACGCTAGTG E1_E1CD2col4_CCGCACAAGATC
## colData names(26): orig.ident nCount_RNA ... pseudotime ident
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
```

The main item in `cm`

that we care about is the counts
matrix, which contains 21,971 genes and 10000 cells. We can view a small
subset of it now.

```
dim(counts(cm))
## [1] 21971 10000
counts(cm)[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'E1_E1CD3col4_CATTTGTGCTTG', 'E1_E1CD3col2_AGAATAAGTCAC', 'E2_E2CD2col5_ATGAATGATGAA' ... ]]
##
## AL627309.1 . . . . . . . . . .
## AL627309.6 . . . . . . . . . .
## AL627309.5 . . . . . . . . . .
## AL669831.3 . . . . . . . . . .
## MTND1P23 1 12 . . 3 1 . . 4 .
## MTND2P28 6 3 . . . . 2 1 . .
## MTCO1P12 2 7 2 . 2 . 1 2 3 .
## MTCO2P12 . 1 . . . . . . . .
## MTATP8P1 . . . . . . . . . .
## MTATP6P1 10 8 . . . . 3 . 2 .
```

There is other important information included in the `cm`

object. For example, the cells were collected from 19 individuals over
the course of 15 days. Information on which individual and which day
each cell is from is saved within the `cm`

object as column
data.

```
table(cm$individual)
##
## 18520 18912 19093 18858 18508 18511 18907 18505 19190 18855 19159 18489 18517
## 472 760 925 732 581 701 362 438 524 283 326 400 534
## 18499 18870 19193 19209 19108 19127
## 771 700 373 554 211 353
table(cm$diffday)
##
## day0 day1 day3 day5 day7 day11 day15
## 2383 2389 1691 967 1260 862 448
```

To perform Poisson count splitting we wish to extract only the count matrix.

```
set.seed(1)
X <- counts(cm)
split <- countsplit(X)
```

`## As no overdispersion parameters were provided, Poisson count splitting will be performed.`

```
Xtrain <- split[[1]]
Xtest <- split[[2]]
```

We now wish to compute clusters on the training set. Unlike in our introductory
tuorial, instead of simply running `kmeans()`

on
`log(Xtrain+1)`

, we will use an existing scRNA-seq pipeline
from the `scran`

package that also involves preprocessing
steps such as selecting highly variable genes. In order to do this, we
need to store the training set count matrix `Xtrain`

in a
`SingleCellExperiment`

object.

We could make a new `SingleCellExperiment`

object that
contains only the `Xtrain`

counts matrix. However, this would
discard all of the metadata that was stored in the `cm`

object, such as information regarding which cells where collected on
which days and from which individuals. Thus, we chose to let
`cm.train`

be a copy of `cm`

where we only update
the `counts()`

attribute. This way, all metadata from
`cm`

is retained in `cm.train`

.

```
cm.train <- cm
counts(cm.train) <- Xtrain
```

Now we are ready for our analysis. These steps were inspired by the
scran
tutorial. We first compute sum factors and then we perform log
normalization. Note that the `computeSumFactors()`

function
requires computing initial clusters of cells using
`quickCluster()`

; these are different from the clusters of
cells that we will later analyze for differential expression.

```
clusters <- quickCluster(cm.train)
cm.train <- computeSumFactors(cm.train, clusters=clusters)
cm.train <- logNormCounts(cm.train)
```

We then compute the top 2000 highly variable genes. While this step
does not alter the dimension of `counts(cm.train)`

, it allows
us to later perform clustering or dimension reduction using only these
highly variable genes.

`top.hvgs <- getTopHVGs(modelGeneVar(cm.train), n=2000)`

Finally, we perform dimension reduction on the dataset (using only the highly variable genes).

`cm.train <- fixedPCA(cm.train, subset.row=top.hvgs)`

We cluster the dimension-reduced dataset using scran’s
`clusterCells`

function (which performs graph-based
clustering).

`clusters.train <- clusterCells(cm.train,use.dimred="PCA")`

It turns out that this function returned 11 clusters. We can visualize them below.

```
table(clusters.train)
ggplot(as_tibble(reducedDim(cm.train)), aes(x=PC1, y=PC2, col=as.factor(clusters.train)))+geom_point()+labs(col="Cluster")
```

```
## Don't know how to automatically pick scale for object of type
## <reduced.dim.matrix/matrix>. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type
## <reduced.dim.matrix/matrix>. Defaulting to continuous.
```

We now consider fitting Poisson GLMs to test for differential
expression between clusters 1 and 2. We can do this analysis “by hand”,
meaning that we do not need to store `Xtest`

in a
`SingleCellExperiment`

object.

For computational efficiency, we test 500 randomly selected genes for differential expression, rather than checking all 21,000 genes. The first step is to subset the test data to only include these 500 genes and to only include cells that were placed in cluster 1 or cluster 2. We will include size factors for each cell, estimated on the training set, as offsets in our Poisson GLM. We need to obtain the appropriate size factor vector and subset it to only include the cells that were placed in cluster 1 or cluster 2. We do these steps below.

```
set.seed(1)
indices <- which(clusters.train==1 | clusters.train==2)
genes <- sample(1:NCOL(Xtest), size=500)
Xtestsubset <- Xtest[genes, indices]
sizefactors.subset <- sizeFactors(cm.train)[indices]
```

We are now ready to test for differential expression. We see that 61 out of our 500 genes were identified as significantly differentially expressed at alpha=0.01.

```
results <- t(apply(Xtestsubset, 1, function(u) summary(glm(u~clusters.train[indices], offset=log(sizefactors.subset), family="poisson"))$coefficients[2,]))
table(results[,4] < 0.01)
##
## FALSE TRUE
## 439 61
```

We can view the names of these highly significant genes below.

```
head(results[results[,4] < 0.01,], n=10)
## Estimate Std. Error z value Pr(>|z|)
## C3orf62 1.2324039 0.43690137 2.820783 4.790661e-03
## PPFIA4 1.6967096 0.62676308 2.707099 6.787406e-03
## CDCA7L 0.6235322 0.14797067 4.213890 2.510091e-05
## ZFR 0.2439864 0.08567689 2.847750 4.402946e-03
## SELENOT 0.5564246 0.18215736 3.054637 2.253332e-03
## COL11A2 2.9288532 1.08012328 2.711592 6.696103e-03
## FAM241A 1.6967096 0.62676433 2.707093 6.787516e-03
## PPP2R5A 0.5493071 0.21081767 2.605603 9.171275e-03
## ZNF138 -0.8594601 0.32129460 -2.674991 7.473132e-03
## ZNF436 1.6070974 0.57008420 2.819053 4.816561e-03
```

In this section, instead of testing for differential expression “by
hand” using `glm()`

, we use the `scoreMarkers()`

function from the scran package. This is the method used in the scran
tutorial, in the section titled “Find Marker Genes”.

In order to use the `scoreMarkers()`

function, we need to
store our test matrix in a `SingleCellExperiment`

object. We
can either construct this from scratch using only the count matrix, or
we could make a copy of the original `cm`

object and add the
count matrix after. Since we do not anticipate needing any metadata in
this section, we construct the object from scratch.

`cm.test <- SingleCellExperiment(list(counts=Xtest))`

The `scoreMarkers()`

function needs access to the
log-normalized counts assay (`logcounts()`

) in the
`cm.test`

object. To fill in this assay, we need to run the
`logNormCounts`

function on `cm.test`

. We would
like to normalize the test matrix using size factors computed on the
training set (this follows our general philosophy of performing all
preprocessing steps on the training set). Thus, we obtain size factors
from the training set before calling `logNormCounts()`

.

```
sizeFactors(cm.test) <- sizeFactors(cm.train)
cm.test <- logNormCounts(cm.test)
```

This first element in `results`

shows marker genes that
distinguish cluster 1 from all other clusters.

```
results <- scran::findMarkers(
cm.test, groups= clusters.train,
pval.type = "all")
results[[1]]
```

```
## DataFrame with 21971 rows and 9 columns
## p.value FDR summary.logFC logFC.2 logFC.3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## TPM1 1.08718e-87 2.38865e-83 1.524155 -2.60820 -1.691339
## TNNT2 3.92556e-20 3.42042e-16 0.243453 -2.74309 -0.629304
## MYL7 4.67036e-20 3.42042e-16 -0.619474 -3.00084 -0.619474
## NEXN 6.38871e-17 3.50916e-13 0.180813 -1.26727 -0.599001
## ANKRD1 8.78375e-17 3.85975e-13 0.256620 -1.45896 -1.366888
## ... ... ... ... ... ...
## GABRQ 1 1 0 0.00000000 0.00000000
## CLIC2 1 1 0 0.00000000 -0.00355741
## EEF1A1P41 1 1 0 -0.00491876 0.00000000
## PCDH11Y 1 1 0 0.00000000 0.00000000
## TMSB4Y 1 1 0 0.00000000 0.00000000
## logFC.4 logFC.5 logFC.6 logFC.7
## <numeric> <numeric> <numeric> <numeric>
## TPM1 1.393705 1.539964 1.764288 1.524155
## TNNT2 0.261539 0.284571 0.299269 0.243453
## MYL7 0.511705 0.595673 0.518480 0.422453
## NEXN 0.165924 0.185703 0.184252 0.180813
## ANKRD1 0.300527 0.322272 0.295123 0.256620
## ... ... ... ... ...
## GABRQ -0.004987919 -0.002776015 0.000000000 0.00000000
## CLIC2 -0.000899236 0.000000000 -0.002090804 0.00000000
## EEF1A1P41 -0.000239274 0.000000000 0.000000000 0.00000000
## PCDH11Y -0.001452754 -0.001119289 -0.000813786 0.00000000
## TMSB4Y 0.000000000 -0.000292628 0.000000000 -0.00585973
```

Even with count splitting, we identify several highly significant
marker genes, both “by hand” and with the `findMarkers()`

function.

Elorbany, Reem, Joshua M Popp, Katherine Rhodes, Benjamin J Strober,
Kenneth Barr, Guanghao Qi, Yoav Gilad, and Alexis Battle. 2022.
“Single-Cell Sequencing Reveals Lineage-Specific Dynamic Genetic
Regulation of Gene Expression During Human Cardiomyocyte
Differentiation.” *PLoS Genetics* 18 (1): e1009666.