Differential Count Analysis with a Multinomial Topic Model — diff_count

Implements methods for analysis of differential count analysis using a topic model. These methods are motivated by gene expression studies, but could have other uses, such as identifying “key words” in topics derived from text documents. To improve accuracy of the differential expression analysis, an empirical Bayes method is used to “stabilize” the estimates. A special case of “hard” topic assignments is also implemented---that is, the mixture proportions are all zeros and ones---which involves greatly simplified (and faster) calculations. Use diff_count_clusters for this special case.

diff_count_analysis(
  fit,
  X,
  s = rowSums(X),
  pseudocount = 0.01,
  fit.method = c("em", "optim", "glm"),
  shrink.method = c("ash", "none"),
  numiter = 100,
  tol = 1e-08,
  e = 1e-15,
  show.warning = TRUE,
  verbose = TRUE,
  ...
)

diff_count_clusters(cluster, X, ...)

Arguments

fit	An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”. If a Poisson NMF fit is provided as input, the corresponding multinomial topic model fit is automatically recovered using `poisson2multinom`.
X	The n x m counts matrix. It can be a sparse matrix (class `"dgCMatrix"`) or dense matrix (class `"matrix"`).
s	A numeric vector of length n determining how the rates are scaled in the Poisson model. See “Details” for guidance on the choice of `s`.
pseudocount	Observations with this value are added to the counts matrix to stabilize calculation of the LFC estimates and other statistics.
fit.method	Method used to estimate LFC and compute test statistics. The `"glm"` and `"optim"` computations are particularly slow, and it is recommended to use `fit.method = "em"` for most data sets. See “Details” for more information.
shrink.method	Method used to stabilize the LFC estimates. When `shrink.method = "ash"`, the "adaptive shrinkage" method implemented in the ashr package is used. When `shrink.method = "none"`, no stabilization is performed, and the “raw” LFC estimates are returned.
numiter	Maximum number of iterations performed in optimization of the Poisson model parameters. When `fit.method = "glm"`, this is passed as argument `maxit` to the `glm` function; when `fit.method = "optim"`, this is passed as argument `maxit` to the `optim` function.
tol	Controls the convergence tolerance for the optimization of the Poisson model parameters. When `fit.method = "glm"`, this is passed as argument `epsilon` to function `glm`; when `fit.method = "optim"`, the `factr` optimization setting is set to `tol * .Machine$double.eps`. When `fit.method = "em"`, the EM algorithm will be stopped early when the largest change between two successive updates is less than `tol`.
e	A small, positive scalar included in some computations to avoid logarithms of zero and division by zero.
show.warning	Set `show.warning = FALSE` to suppress a message about calculations when mixture proportions are all 0 or 1.
verbose	When `verbose = TRUE`, progress information is printed to the console.
...	Additional arguments passed to `diff_count_analysis`.
cluster	A factor, or vector that can be converted to a factor such as an integer or character vector, giving an assignment of samples (rows of `X`) to clusters. This could be, for example, the "cluster" output from `kmeans`.

Value

The return value is a list of m x k matrices, where m is the number of columns in the counts matrix, and k is the number of topics (for diff_count_clusters, m is the number of clusters), and an additional vector:

colmeans

A vector of length m containing the count averages (colMeans(X)).

Estimates of the Poisson model parameters $f_0$.

Estimates of the Poisson model parameters $f_1$.

beta

LFC estimates beta = log2(F1/F0).

Standard errors of the Poisson glm parameters $b = f_1 - f_0$.

z-scores for the Poisson glm parameters $b = f_1 - f_0$.

pval

-log10 two-tailed p-values computed from the z-scores. In some extreme cases the calculations may produce zero or negative standard errors, in which case the standard errors are set to NA and the z-scores and -log10 p-values are set to zero.

Details

The methods are based on the following univariate (“single-count”) Poisson model: $$x_i ~ Poisson(s_i \lambda_i),$$ in which the Poisson rates are defined as $$\lambda_i = (1 - q_i) f_0 + q_i f_1,$$ and where the two unknowns to be estimated are $f_0, f_1 > 0.$ This model is applied separately to each count (column of the counts matrix), and to each topic k. The $q_i$'s are the topic probabilities for a given topic. An EM algorithm is used to compute maximum-likelihood estimates (MLEs) of the two unknowns.

The log-fold change (LFC) statistics are defined as the log-ratio of the two Poisson rate parameters, $\beta = \log_2(f_1/f_0)$. This statistic measures the increase (or decrease) in occurance in one topic compared to all other topics. The use of the base-2 logarithm comes from the convention used in gene expression studies.

When each $s_i$ (specified by input argument s) is equal the total count for sample i (this is the default setting in diff_count_analysis), the Poisson model will closely approximate a binomial model of the count data, so that the Poisson model parameters $f_0, f_1$ represent binomial probabilities. In this case, $\beta$ represents the LFC in relative occurrence. (This Poisson approximation to the binomial is most accurate when the total counts rowSums(X) are large and $f_0, f_1$ are small.)

Other choices for s are possible, and implement different normalization schemes for the counts, and different interpretations of the LFC statistics. Setting s to all ones implements the differential count analysis with no normalization, and $\beta$ represents the LFC in absolute occurrence. This choice could be appropriate in settings where the total count is well-controlled across samples (rows of X).

When fit.method = "glm", the LFC estimates and test statistics are implemented by glm with formula = x ~ b0 + b - 1 and family = poisson(link = "identity"), where the unknowns are defined as $b_0 = f_0$ and $b = f_1 - f_0$. The outputted z-scores are p-values are for the b coefficient; that is, they are test statistics for the test that $f_0$ and $f_1$ are not equal.

When fit.method = "optim" or fit.method = "em", maximum-likelihood estimates are computed using optim or using a fast EM algorithm, and test statistics (standard errors, z-scores and p-values) are based on a Laplace approximation to the likelihood at the MLE. These calculations closely reproduce the test statistic calculations in glm.

We recommend setting shrink.method = "ash", which uses the “adaptive shrinkage” method (Stephens, 2016) to improve accuracy of the LFC estimates. The improvement in accuracy is greatest for genes with low expression, or when the sample size is small (or both). We follow the settings used in lfcShrink from the DESeq2 package, with type = "ashr".

References

Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), kxw041. https://doi.org/10.1093/biostatistics/kxw041

Zhu, A., Ibrahim, J. G. and Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12), 2084–2092.