`vignettes/single_cell_rnaseq_practical.Rmd`

`single_cell_rnaseq_practical.Rmd`

This vignette continues from Part 1, where we introduced the basic steps of a topic modeling analysis for single-cell data. Here we give more detailed guidance and discuss complications that may arise.

We begin our analysis by loading the packages. Then we set the seed so that the results can be reproduced.

Now we load the data set and the \(K = 6\) pre-fitted topic model.

```
data(pbmc_facs)
counts <- pbmc_facs$counts
fit <- pbmc_facs$fit
```

The topic model was fitted using the `fit_topic_model`

function. This function, as we mentioned, hides most of the complexities
of model fitting. Nevertheless, it is a good idea to check that the
model fitting has converged to a maximum-likelihood solution. This is
easily done with ‘fastTopics’:

```
plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") +
theme_cowplot(font_size = 10)
```

Internally, `fit_topic_model`

first fits a non-negative
matrix factorization (NMF) to the count data, then converts the fitted
NMF to a topic model, so the plot shows the improvement in the model fit
over time. Judging by this plot, the parameter estimates get close to a
maxium-likelihood solution after about 150 updates.

For larger or more complex data sets, more updates may be needed to
obtain a good fit. For guidance, see `help(fit_topic_model)`

and `help(fit_poisson_nmf)`

.

The topic model can be used to calculate *a likelihood for each
cell:*

`loglik <- loglik_multinom_topic_model(counts,fit)`

This can be used to assess how well the topic model “fits” each cell.

```
pdat <- data.frame(loglik)
ggplot(pdat,aes(loglik)) +
geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)
```

Most of the poorly fit cells are in the CD34+ subpopulation:

```
subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold")
pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop)
ggplot(pdat,aes(x = loglik,fill = subpop)) +
geom_histogram(bins = 64,color = "white",size = 0.25) +
scale_fill_manual(values = subpop_colors) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)
```

In [Part 1][single_cell_rnaseq_practical], we made use of additional information about the cells to help us interpret the topics. Even without the cell labels, the Structure plot can still be effective:

```
topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue",
"gold","darkorange")
structure_plot(fit,colors = topic_colors)
```

From this Structure plot, the topics clearly distinguish the B cells (dark blue), CD14+ monocytes (green) and CD34+ cells (purple). The NK cells (light blue) and T cells (cells high proportions of yellow and orange) are harder to distinguish, and this is reflected in sharing of topics (mostly topic 6) among NK and T cells.

Of course, without the cell labels, we cannot know that the topics correspond to these cell types without further downstream analyses—for this, we performed a GoM DE analysis (see Part 1).

In more complex data sets, some structure may not show up well without additional fine-tuning of the plot.

One simple strategy that often works well is to first identify clusters in the topic proportions matrix, \(L\). Here we use \(k\)-means to identify clusters, but other methods could be used.

```
set.seed(1)
pca <- prcomp(fit$L)$x
clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster
summary(factor(clusters))
# 1 2 3 4 5 6
# 400 207 1047 797 616 707
```

We chose 6 clusters, but to be clear the best number of clusters may not be the same as the number of topics.

Now we create another Structure plot in which the cells are grouped according to the clusters:

```
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
grouping = clusters) +
theme(axis.text.x = element_blank())
```

k-means somewhat arbitrarily split the T cells (cells with high proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we merge these two clusters:

```
clusters[clusters == 3] <- 1
clusters <- factor(clusters)
```

We have found that visual inspection of the clusters followed by manual refinement is often be needed to get the “right” clustering.

Now we can label the clusters by these cell types in the plot. (Noting that this labeling is usually not possible until downstream analyses have been performed.)

```
levels(clusters) <- c("T","CD14+","B","CD34+","NK")
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
grouping = clusters)
```

It is also sometimes helpful to visualize the structure in PCA plots which show the projection of the cells onto PCs of the topic proportions:

```
cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue")
pca_plot(fit,fill = clusters) +
scale_fill_manual(values = cluster_colors)
```

When there are many overlapping points, plotting the *density*
can sometimes show the clusters more clearly:

`pca_hexbin_plot(fit,bins = 24)`

In Part 1, we performed a DE analysis using the topic model.

`de <- pbmc_facs$de`

When the volcano plot shows many overlapping differentially expressed
genes, it can be helpful to explore the results interactively. The
function `volcano_plotly`

creates an *interactive volcano
plot* that can be viewed in a Web browser. For example, here is an
interactive volcano plot for the T cells topic:

```
genes <- pbmc_facs$genes
p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html",
labels = genes$symbol,ymax = 100)
```

This call creates a new file `volcano_plot_t_cells.html`

.
The interactive volcano plot can also be viewed here,
or by calling `print(p)`

in R.

This is the version of R and the packages that were used to generate these results.

```
sessionInfo()
# R version 3.6.2 (2019-12-12)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS Catalina 10.15.7
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] cowplot_1.1.1 ggplot2_3.3.6 fastTopics_0.6-150 Matrix_1.2-18
#
# loaded via a namespace (and not attached):
# [1] mcmc_0.9-6 fs_1.5.2 progress_1.2.2 httr_1.4.2
# [5] rprojroot_1.3-2 tools_3.6.2 backports_1.1.5 bslib_0.3.1
# [9] utf8_1.1.4 R6_2.4.1 irlba_2.3.3 uwot_0.1.10
# [13] DBI_1.1.0 lazyeval_0.2.2 colorspace_1.4-1 withr_2.5.0
# [17] tidyselect_1.1.1 prettyunits_1.1.1 compiler_3.6.2 cli_3.5.0
# [21] quantreg_5.54 SparseM_1.78 desc_1.2.0 plotly_4.10.1
# [25] labeling_0.3 sass_0.4.0 scales_1.1.0 SQUAREM_2017.10-1
# [29] hexbin_1.28.0 quadprog_1.5-8 pbapply_1.5-1 pkgdown_2.0.2
# [33] mixsqp_0.3-46 systemfonts_1.0.2 stringr_1.4.0 digest_0.6.23
# [37] rmarkdown_2.11 MCMCpack_1.4-5 pkgconfig_2.0.3 htmltools_0.5.4
# [41] fastmap_1.1.0 invgamma_1.1 highr_0.8 htmlwidgets_1.6.1
# [45] rlang_1.0.6 shiny_1.7.4 jquerylib_0.1.4 generics_0.0.2
# [49] farver_2.0.1 jsonlite_1.7.2 crosstalk_1.0.0 dplyr_1.0.7
# [53] magrittr_2.0.1 Rcpp_1.0.8 munsell_0.5.0 fansi_0.4.0
# [57] lifecycle_1.0.3 stringi_1.4.3 yaml_2.2.0 MASS_7.3-51.4
# [61] Rtsne_0.15 grid_3.6.2 promises_1.1.0 parallel_3.6.2
# [65] ggrepel_0.9.1 crayon_1.4.1 lattice_0.20-38 hms_1.1.0
# [69] knitr_1.37 pillar_1.6.2 glue_1.4.2 evaluate_0.14
# [73] data.table_1.12.8 RcppParallel_5.1.5 httpuv_1.5.2 vctrs_0.3.8
# [77] MatrixModels_0.4-1 gtable_0.3.0 purrr_0.3.4 tidyr_1.1.3
# [81] assertthat_0.2.1 ashr_2.2-54 xfun_0.29 mime_0.8
# [85] xtable_1.8-4 later_1.0.0 coda_0.19-3 ragg_0.3.1
# [89] viridisLite_0.3.0 truncnorm_1.0-8 tibble_3.1.3 memoise_1.1.0
# [93] ellipsis_0.3.2
```