This vignette continues from Part 1, where we introduced the basic steps of a topic modeling analysis for single-cell data. Here we give more detailed guidance and discuss complications that may arise.

We begin our analysis by loading the packages. Then we set the seed so that the results can be reproduced.

library(Matrix)
library(fastTopics)
library(ggplot2)
library(cowplot)
set.seed(1)

Now we load the data set and the $$K = 6$$ pre-fitted topic model.

data(pbmc_facs)
counts <- pbmc_facs$counts fit <- pbmc_facs$fit

## Assessing convergence of model fitting

The topic model was fitted using the fit_topic_model function. This function, as we mentioned, hides most of the complexities of model fitting. Nevertheless, it is a good idea to check that the model fitting has converged to a maximum-likelihood solution. This is easily done with ‘fastTopics’:

plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") +
theme_cowplot(font_size = 10)

Internally, fit_topic_model first fits a non-negative matrix factorization (NMF) to the count data, then converts the fitted NMF to a topic model, so the plot shows the improvement in the model fit over time. Judging by this plot, the parameter estimates get close to a maxium-likelihood solution after about 150 updates.

For larger or more complex data sets, more updates may be needed to obtain a good fit. For guidance, see help(fit_topic_model) and help(fit_poisson_nmf).

## Evaluating fit: single-cell likelihoods

The topic model can be used to calculate a likelihood for each cell:

loglik <- loglik_multinom_topic_model(counts,fit)

This can be used to assess how well the topic model “fits” each cell.

pdat <- data.frame(loglik)
ggplot(pdat,aes(loglik)) +
geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)

Most of the poorly fit cells are in the CD34+ subpopulation:

subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold")
pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop)
ggplot(pdat,aes(x = loglik,fill = subpop)) +
geom_histogram(bins = 64,color = "white",size = 0.25) +
scale_fill_manual(values = subpop_colors) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)

## Visualizing the topics without cell labels

In [Part 1][single_cell_rnaseq_practical], we made use of additional information about the cells to help us interpret the topics. Even without the cell labels, the Structure plot can still be effective:

topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue",
"gold","darkorange")
structure_plot(fit,colors = topic_colors)

From this Structure plot, the topics clearly distinguish the B cells (dark blue), CD14+ monocytes (green) and CD34+ cells (purple). The NK cells (light blue) and T cells (cells high proportions of yellow and orange) are harder to distinguish, and this is reflected in sharing of topics (mostly topic 6) among NK and T cells.

Of course, without the cell labels, we cannot know that the topics correspond to these cell types without further downstream analyses—for this, we performed a GoM DE analysis (see Part 1).

## Identifying clusters from the topic proportions

In more complex data sets, some structure may not show up well without additional fine-tuning of the plot.

One simple strategy that often works well is to first identify clusters in the topic proportions matrix, $$L$$. Here we use $$k$$-means to identify clusters, but other methods could be used.

set.seed(1)
pca <- prcomp(fit$L)$x
clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster summary(factor(clusters)) # 1 2 3 4 5 6 # 400 207 1047 797 616 707 We chose 6 clusters, but to be clear the best number of clusters may not be the same as the number of topics. Now we create another Structure plot in which the cells are grouped according to the clusters: structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, grouping = clusters) + theme(axis.text.x = element_blank()) k-means somewhat arbitrarily split the T cells (cells with high proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we merge these two clusters: clusters[clusters == 3] <- 1 clusters <- factor(clusters) We have found that visual inspection of the clusters followed by manual refinement is often be needed to get the “right” clustering. Now we can label the clusters by these cell types in the plot. (Noting that this labeling is usually not possible until downstream analyses have been performed.) levels(clusters) <- c("T","CD14+","B","CD34+","NK") structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, grouping = clusters) It is also sometimes helpful to visualize the structure in PCA plots which show the projection of the cells onto PCs of the topic proportions: cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue") pca_plot(fit,fill = clusters) + scale_fill_manual(values = cluster_colors) When there are many overlapping points, plotting the density can sometimes show the clusters more clearly: pca_hexbin_plot(fit,bins = 24) ## More on differential expression analysis In Part 1, we performed a DE analysis using the topic model. de <- pbmc_facs$de

When the volcano plot shows many overlapping differentially expressed genes, it can be helpful to explore the results interactively. The function volcano_plotly creates an interactive volcano plot that can be viewed in a Web browser. For example, here is an interactive volcano plot for the T cells topic:

genes <- pbmc_facs$genes p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html", labels = genes$symbol,ymax = 100)

This call creates a new file volcano_plot_t_cells.html. The interactive volcano plot can also be viewed here, or by calling print(p) in R.

## Session info

This is the version of R and the packages that were used to generate these results.

sessionInfo()
# R version 3.6.2 (2019-12-12)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS Catalina 10.15.7
#
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# other attached packages:
# [1] cowplot_1.1.1      ggplot2_3.3.6      fastTopics_0.6-150 Matrix_1.2-18
#
# loaded via a namespace (and not attached):
#  [1] mcmc_0.9-6         fs_1.5.2           progress_1.2.2     httr_1.4.2
#  [5] rprojroot_1.3-2    tools_3.6.2        backports_1.1.5    bslib_0.3.1
#  [9] utf8_1.1.4         R6_2.4.1           irlba_2.3.3        uwot_0.1.10
# [13] DBI_1.1.0          lazyeval_0.2.2     colorspace_1.4-1   withr_2.5.0
# [17] tidyselect_1.1.1   prettyunits_1.1.1  compiler_3.6.2     cli_3.5.0
# [21] quantreg_5.54      SparseM_1.78       desc_1.2.0         plotly_4.10.1
# [25] labeling_0.3       sass_0.4.0         scales_1.1.0       SQUAREM_2017.10-1
# [29] hexbin_1.28.0      quadprog_1.5-8     pbapply_1.5-1      pkgdown_2.0.2
# [33] mixsqp_0.3-46      systemfonts_1.0.2  stringr_1.4.0      digest_0.6.23
# [37] rmarkdown_2.11     MCMCpack_1.4-5     pkgconfig_2.0.3    htmltools_0.5.4
# [41] fastmap_1.1.0      invgamma_1.1       highr_0.8          htmlwidgets_1.6.1
# [45] rlang_1.0.6        shiny_1.7.4        jquerylib_0.1.4    generics_0.0.2
# [49] farver_2.0.1       jsonlite_1.7.2     crosstalk_1.0.0    dplyr_1.0.7
# [53] magrittr_2.0.1     Rcpp_1.0.8         munsell_0.5.0      fansi_0.4.0
# [57] lifecycle_1.0.3    stringi_1.4.3      yaml_2.2.0         MASS_7.3-51.4
# [61] Rtsne_0.15         grid_3.6.2         promises_1.1.0     parallel_3.6.2
# [65] ggrepel_0.9.1      crayon_1.4.1       lattice_0.20-38    hms_1.1.0
# [69] knitr_1.37         pillar_1.6.2       glue_1.4.2         evaluate_0.14
# [73] data.table_1.12.8  RcppParallel_5.1.5 httpuv_1.5.2       vctrs_0.3.8
# [77] MatrixModels_0.4-1 gtable_0.3.0       purrr_0.3.4        tidyr_1.1.3
# [81] assertthat_0.2.1   ashr_2.2-54        xfun_0.29          mime_0.8
# [85] xtable_1.8-4       later_1.0.0        coda_0.19-3        ragg_0.3.1
# [89] viridisLite_0.3.0  truncnorm_1.0-8    tibble_3.1.3       memoise_1.1.0
# [93] ellipsis_0.3.2