`vignettes/single_cell_rnaseq_basic.Rmd`

`single_cell_rnaseq_basic.Rmd`

The aim of this vignette is to introduce the basic concepts behind an analysis of single-cell RNA-seq data using a topic model, and to show how to use **fastTopics** to implement a topic model analysis. We introduce the basic concepts and **fastTopics** interface through a simple example. This first vignette is only intended to explain the topic model analysis at a high level—see Part 2 for additional explanations and guidance.

Since a topic model analysis is quite different from conventional analyses of single-cell RNA-seq data, in this vignette we carefully point out key differences and clarify possible misconceptions. One key difference is that a topic model is a model of *count data*, so *the topic model is intended to be applied directly to the count data* that arises from the RNA sequencing assay. This is in contrast with most methods that require careful pre-processing of the count data.

We begin our analysis by loading the packages. Then we set the seed so that the results can be reproduced.

We will illustrate the concepts using a single-cell RNA-seq data set from Zheng *et al* (2017). These data are reference transcriptome profiles from 10 bead-enriched subpopulations of peripheral blood mononuclear cells (PBMCs). The original data set is much larger—for this introduction, we have taken a small subset of approximately 3,700 cells.

The data we will analyze are the unique molecular identifier (UMI) counts. These data are stored as an \(n \times m\) sparse matrix, where \(n\) is the number of cells and \(m\) is the number of genes:

The UMI count data are expected to be “sparse”—that is, most of the counts are expected to be zero. Indeed, over 95% of the UMI counts are zero:

mean(counts > 0) # [1] 0.04265257

Most analyses single-cell RNA-seq data involve a pre-processing step in which the UMI counts are log-transformed and normalized. *When analyzing UMI counts with a topic model, you should not normalize or transform the counts.* In particular:

There is no need to log-transform the UMI counts because the counts are modeled directly using the multinomial distribution.

There is no need to normalize the counts in each cell by sequencing coverage because the multinomial is

*conditioned*on the cell’s total UMI count.

Additionally, analyses of single-cell RNA-seq data typically select only the most highly variable genes. *We recommend instead using all genes.* The only exception are genes in which all the UMI counts are zero—these genes should be removed prior to a topic model analysis.

Since no pre-processing is needed, we can move directly to the next step of the analysis: fitting the topic model to the UMI count data. This is accomplished with a call to `fit_topic_model`

:

fit <- fit_topic_model(counts,k = 6)

This is typically the most computationally burdensome step—it may take several minutes to run `fit_topic_model`

on this data set. For convenience, we saved the output from this call:

fit <- pbmc_facs$fit

To fit a topic model, we must specify \(K\), the number of topics. Here, we have chosen \(K = 6\) topics. In most settings, a good choice of \(K\) will not be known in advance, so you will you want to explore the results from topic models at different settings of \(K\).

There are several complexities to topic model fitting—we do not elaborate on these complexities here. The `fit_topic_model`

interface is intended to hide most of these complexities, and it should work well for a wide range of data sets. However, larger or more challenging data sets my require some fine-tuning of the model fitting. See Part 2 for more on this.

A key feature of the topic model is that each cell \(i\) is represented as a *unique mixture of the topics.* Therefore, each cell can be summarized from the \(K\) mixture proportions. This cell-specific mixture is learned from the data. In **fastTopics**, the mixture proportions for all cells are stored as an \(n \times K\) matrix:

dim(fit$L) # [1] 3774 6

To illustrate, here is a cell in which the pattern of expression is almost fully captured by the fourth topic:

rows <- "GATATATGTCAGTG-1-b_cells" round(fit$L[rows,],digits = 3) # k1 k2 k3 k4 k5 k6 # 0.000 0.000 0.000 0.985 0.015 0.000

Here are two more examples:

rows <- c("GACAGTACCTGTGA-1-memory_t", "TGAAGCACACAGCT-1-b_cells") round(fit$L[rows,],digits = 3) # k1 k2 k3 k4 k5 k6 # GACAGTACCTGTGA-1-memory_t 0 0.002 0.002 0.000 0.087 0.909 # TGAAGCACACAGCT-1-b_cells 0 0.000 0.058 0.657 0.000 0.285

The third example is interesting because the observed expression is best captured by a *mixture* of topics 4 and 6.

To make sense of these results, we first need to understand the biological relevance of the topics. We turn to this question next.

In some cases, you may have additional information about the cells, such as the tissue the cells were sampled from. (For tips on exploring the results when cell labels are not available, see Part 2.) In the PBMC data, the cells were labeled using fluorescence-activated cell sorting (FACS). Each cell is assigned one of five labels (each corresponding to a cell type): B cells, CD14+ monocytes, CD34+ cells, natural killer (NK) cells, and T cells.

samples <- pbmc_facs$samples summary(samples$subpop) # B cell CD14+ CD34+ NK cell T cell # 767 163 687 673 1484

To visualize the relationship between the cell labels and the mixture proportions, we create a Structure plot:

topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue", "gold","darkorange") structure_plot(fit,colors = topic_colors,topics = 1:6,gap = 25, grouping = samples$subpop)

The Structure plot is a simply stacked bar chart, in which each topic is represented as a bar of a different colour. Being *proportions*, the mixture proportions for each cell must sum to 1, so the total height of the bars is the same for all cells, which makes it easier to compare across cells. Patterns begin to emerge when the cells are arranged so that cells with similar mixture proportions are positioned close to each other. This arrangement is automated in `structure_plot`

.

From this Structure plot, it is evident that topics 1 (lighter blue), 2 (green) and 4 (darker blue) closely correspond to the NK, CD14+ and B cell types, respectively; the expression in these cells is largely explained by a single topic. NK cells show more heterogeneity in expression than the CD14+ and B cells.

The mixture proportions uncover some surprises in the cells labeled as “CD34+”: while their expression is largely explained by topic 3 (purple), some cells appear to be mislabeled—they should have been labeled as B cells or CD14+ cells. Other cells are best fit by a mixture of several topics.

Finally, the topic model captures interesting substructure within the T cells. Two patterns stand out from the Structure plot: first, most T cells are a mixture of two topics, 5 (yellow) and 6 (orange), with wide variation in the mixing proportions for these two topics; second, there is a distinctive subset of T cells that is represented as a mixture of three topics (topics 1, 5 and 6). To help us understand what aspects of T cell diversity are captured by these topics, we need to first understand how topics capture shared patterns of gene expression.

Having fitted a topic model to single-cell data, it is easy to use that model to predict the topic proportions for other cells that were not used to fit the model. We illustrate this with a separate collection of 1,000 cells drawn from the same PBMC data set:

counts_test <- pbmc_facs$counts_test dim(counts_test) # [1] 1000 16791

The “predict” function estimates the topic proportions based on a previously fit topic model.

fit_test <- fit fit_test$s <- NULL fit_test$L <- predict(fit,counts_test)

The predictions appear to align well with the provided cell labels:

structure_plot(fit_test,topics = 1:6,colors = topic_colors,gap = 10, grouping = pbmc_facs$samples_test$subpop)

Above, we explained that the topic model represents cells as mixtures of topics. In our running example, some of the topics seem to have a straightforward interpretation as known cell types. More precisely, each topic captures an expression pattern, encoded as a vector of \(m\) relative expression levels (\(m\) is the number of genes). In **fastTopics**, the relative expression levels for all topics are stored as an \(m \times K\) matrix:

dim(fit$F) # [1] 16791 6

Because the expression patterns are encoded by *relative expression levels*, the estimates of relative expression are directly comparable across topics. For example, gene *CD79B* is an important component of B cell differentiation, so we would expect higher expression in B cells. Indeed, this is what we find in the topic 4 (we multiplied the estimates by \(10^6\) only to make them easier to read):

genes <- pbmc_facs$genes j <- which(genes$symbol == "CD79B") round(1e6 * fit$F[j,]) # k1 k2 k3 k4 k5 k6 # 40 0 0 2135 199 0

Expression of *CD79A*—a known marker gene for B cells—provides an even stronger clue, as it is uniquely expressed in topic 4:

By contrast, genes that are highly expressed in all topics, e.g., *COX4I1*, provide little useful information:

j <- which(genes$symbol == "COX4I1") round(1e6 * fit$F[j,]) # k1 k2 k3 k4 k5 k6 # 1231 3535 1444 1759 1158 2104

The larger point here is that the most informative genes are the genes with higher expression in one topic compared to the other topics (and more generally, genes with large differences in gene expression). This idea of analyzing differences in gene expression analysis is an old idea. The next section explains how differential expression analysis is performed using the topic model.

To set the stage for differential expression (DE) analysis using a topic model, we begin with a classic DE analysis using the available cell-type labels. This analysis centers on calculation of the *log-fold change* (LFC) statistic. For a given group or cluster, the LFC is defined as the (base-2) log-ratio of two expectations: the mean expression (*i.e.*, mean UMI count) for all cells belonging to the specifed group over the mean expression among cells not belonging to the group. For example, genes *CD79A* and *CD79B* are overexpressed in B cells, by a multiplicative factor of \(2^{5.66} \approx 50\) and \(2^{4.9} \approx 30\), respectively, and they are underexpressed in the other four cell types:

dfc_out <- diff_count_clusters(samples$subpop,counts) rbind(dfc_out$beta[genes$symbol == "CD79A",], dfc_out$beta[genes$symbol == "CD79B",]) # Fitting 16791 x 5 = 83955 univariate Poisson models. # Computing log-fold change statistics. # Stabilizing log-fold change estimates using adaptive shrinkage. # B cell CD14+ CD34+ NK cell T cell # [1,] 5.659506 -10.438944 -2.815194 -7.180773 -4.596314 # [2,] 4.936403 -4.131096 -2.611642 -2.859923 -3.454288

**fastTopics** extends the classic DE analysis to allow for *partial membership* to groups. This generalized DE analysis is implemented in `diff_count_analysis`

:

dfa_out <- diff_count_analysis(fit,counts) # Fitting 16791 x 6 = 100746 univariate Poisson models. # Computing log-fold change statistics. # Stabilizing log-fold change estimates using adaptive shrinkage.

As in the standard analysis, it outputs LFC estimates:

rbind(dfa_out$beta[genes$symbol == "CD79A",], dfa_out$beta[genes$symbol == "CD79B",]) # k1 k2 k3 k4 k5 k6 # [1,] -10.986622 -10.651502 -11.04399 17.21208 -9.930242 -10.82830 # [2,] -9.223571 -9.678166 -10.49940 6.37068 -9.433202 -10.64272

These LFC statistics have the same interpretation (the standard LFC emerges as a special case when the all mixture proportions are 0 or 1). The LFC estimates are noticeably larger in the B cells topic (topic 4), particularly so for *CD79A*. In fact, *CD79A* is almost uniquely expressed in topic 4.

Up to this point, we have illustrated the DE analysis with known genes such as *CD79A*. Ultimately, we would like to *discover* genes relevant to a topic (of known or unknown biological relevance).

To discover genes, a natural thing to do would be to rank candidates by their LFC. The complication is that lowly expressed genes could also produce, by chance, large LFCs, and we would not want to rank these genes highly. The *volcano plot* balances these two concerns: the LFC is shown on the x-axis, and the *p*-value or *z*-score, quantifying support for differential expression, is shown on the *y*-axis. Here is volcano plot for topic 4:

volcano_plot(dfa_out,k = 4,label_above_quantile = 0.995, labels = genes$symbol)

Typically, the most interesting genes are found in the top-right portion of the volcano plot—that is, genes with large LFC and strong support (small *p*-value or high-magnitude *z*-score). Indeed, B-cell genes, among them *CD79A* and *CD79B*, appear near the top-right corner of the volcano plot.

(Note that the x-axis in the volcano plot ranges between -15 and 15; by default, LFC estimates outside this range are projected onto this range. You can use the `betamax`

argument to change how the LFC estimates are shown in the volcano plot.)

Likewise, natural killer genes such as *NKG7* emerge at the top of the volcano plot for topic 1:

volcano_plot(dfa_out,k = 1,label_above_quantile = 0.995, labels = genes$symbol)

T-cell-specific expression patterns are captured by topic 6 (e.g., genes *CD3D*, *CD3E*):

volcano_plot(dfa_out,k = 6,label_above_quantile = 0.9925, labels = genes$symbol)

See also Part 2 for more DE analysis examples.

This is the version of R and the packages that were used to generate these results.

sessionInfo() # R version 3.6.2 (2019-12-12) # Platform: x86_64-apple-darwin15.6.0 (64-bit) # Running under: macOS Catalina 10.15.7 # # Matrix products: default # BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib # LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib # # locale: # [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 # # attached base packages: # [1] stats graphics grDevices utils datasets methods base # # other attached packages: # [1] cowplot_1.0.0 ggplot2_3.3.0 fastTopics_0.5-52 Matrix_1.2-18 # # loaded via a namespace (and not attached): # [1] ggrepel_0.9.0 Rcpp_1.0.5 invgamma_1.1 lattice_0.20-38 # [5] tidyr_1.0.0 prettyunits_1.1.1 assertthat_0.2.1 zeallot_0.1.0 # [9] rprojroot_1.3-2 digest_0.6.23 truncnorm_1.0-8 R6_2.4.1 # [13] backports_1.1.5 MatrixModels_0.4-1 evaluate_0.14 coda_0.19-3 # [17] httr_1.4.2 pillar_1.4.3 progress_1.2.2 rlang_0.4.5 # [21] lazyeval_0.2.2 data.table_1.12.8 irlba_2.3.3 SparseM_1.78 # [25] rmarkdown_2.3 pkgdown_1.5.1 labeling_0.3 desc_1.2.0 # [29] Rtsne_0.15 stringr_1.4.0 htmlwidgets_1.5.1 uwot_0.1.10 # [33] munsell_0.5.0 mixsqp_0.3-46 compiler_3.6.2 xfun_0.11 # [37] pkgconfig_2.0.3 SQUAREM_2017.10-1 mcmc_0.9-6 htmltools_0.4.0 # [41] tidyselect_0.2.5 tibble_2.1.3 quadprog_1.5-8 viridisLite_0.3.0 # [45] withr_2.1.2 crayon_1.3.4 dplyr_0.8.3 MASS_7.3-51.4 # [49] grid_3.6.2 jsonlite_1.6 gtable_0.3.0 lifecycle_0.1.0 # [53] magrittr_1.5 scales_1.1.0 RcppParallel_4.4.2 stringi_1.4.3 # [57] farver_2.0.1 fs_1.3.1 vctrs_0.2.1 tools_3.6.2 # [61] glue_1.3.1 purrr_0.3.3 hms_0.5.2 yaml_2.2.0 # [65] colorspace_1.4-1 ashr_2.2-51 memoise_1.1.0 plotly_4.9.2 # [69] knitr_1.26 quantreg_5.54 MCMCpack_1.4-5