R/diffcount.R
diff_count_analysis.Rd
Implements methods for analysis of differential count
analysis using a topic model. These methods are motivated by gene
expression studies, but could have other uses, such as identifying
“key words” in topics derived from text documents. To
improve accuracy of the differential expression analysis, an
empirical Bayes method is used to “stabilize” the
estimates. A special case of “hard” topic assignments is
also implemented---that is, the mixture proportions are all zeros
and ones---which involves greatly simplified (and faster)
calculations. Use diff_count_clusters
for this special case.
diff_count_analysis( fit, X, s = rowSums(X), pseudocount = 0.01, fit.method = c("em", "optim", "glm"), shrink.method = c("ash", "none"), numiter = 100, tol = 1e-08, e = 1e-15, show.warning = TRUE, verbose = TRUE, ... ) diff_count_clusters(cluster, X, ...)
fit | An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”. If a Poisson NMF fit is provided
as input, the corresponding multinomial topic model fit is
automatically recovered using |
---|---|
X | The n x m counts matrix. It can be a sparse matrix (class
|
s | A numeric vector of length n determining how the rates are
scaled in the Poisson model. See “Details” for guidance on
the choice of |
pseudocount | Observations with this value are added to the counts matrix to stabilize calculation of the LFC estimates and other statistics. |
fit.method | Method used to estimate LFC and compute test
statistics. The |
shrink.method | Method used to stabilize the LFC estimates.
When |
numiter | Maximum number of iterations performed in
optimization of the Poisson model parameters. When
|
tol | Controls the convergence tolerance for the optimization
of the Poisson model parameters. When |
e | A small, positive scalar included in some computations to avoid logarithms of zero and division by zero. |
show.warning | Set |
verbose | When |
... | Additional arguments passed to
|
cluster | A factor, or vector that can be converted to a
factor such as an integer or character vector, giving an assignment
of samples (rows of |
The return value is a list of m x k matrices, where m is
the number of columns in the counts matrix, and k is the number of
topics (for diff_count_clusters
, m is the number of
clusters), and an additional vector:
A vector of length m containing the count averages
(colMeans(X)
).
Estimates of the Poisson model parameters \(f_0\).
Estimates of the Poisson model parameters \(f_1\).
LFC estimates beta = log2(F1/F0)
.
Standard errors of the Poisson glm parameters \(b = f_1 - f_0\).
z-scores for the Poisson glm parameters \(b = f_1 - f_0\).
-log10 two-tailed p-values computed from the
z-scores. In some extreme cases the calculations may produce zero
or negative standard errors, in which case the standard errors are
set to NA
and the z-scores and -log10 p-values are set to
zero.
The methods are based on the following univariate (“single-count”) Poisson model: $$x_i ~ Poisson(s_i \lambda_i),$$ in which the Poisson rates are defined as $$\lambda_i = (1 - q_i) f_0 + q_i f_1,$$ and where the two unknowns to be estimated are \(f_0, f_1 > 0.\) This model is applied separately to each count (column of the counts matrix), and to each topic k. The \(q_i\)'s are the topic probabilities for a given topic. An EM algorithm is used to compute maximum-likelihood estimates (MLEs) of the two unknowns.
The log-fold change (LFC) statistics are defined as the log-ratio of the two Poisson rate parameters, \(\beta = \log_2(f_1/f_0)\). This statistic measures the increase (or decrease) in occurance in one topic compared to all other topics. The use of the base-2 logarithm comes from the convention used in gene expression studies.
When each \(s_i\) (specified by input argument s
) is equal
the total count for sample i (this is the default setting in
diff_count_analysis
), the Poisson model will closely
approximate a binomial model of the count data, so that the Poisson
model parameters \(f_0, f_1\) represent binomial probabilities.
In this case, \(\beta\) represents the LFC in relative
occurrence. (This Poisson approximation to the binomial is most
accurate when the total counts rowSums(X)
are large and
\(f_0, f_1\) are small.)
Other choices for s
are possible, and implement different
normalization schemes for the counts, and different interpretations
of the LFC statistics. Setting s
to all ones implements the
differential count analysis with no normalization, and \(\beta\)
represents the LFC in absolute occurrence. This choice could
be appropriate in settings where the total count is well-controlled
across samples (rows of X
).
When fit.method = "glm"
, the LFC estimates and test
statistics are implemented by glm
with
formula = x ~ b0 + b - 1
and family = poisson(link =
"identity")
, where the unknowns are defined as \(b_0 = f_0\) and
\(b = f_1 - f_0\). The outputted z-scores are p-values are for
the b
coefficient; that is, they are test statistics for the
test that \(f_0\) and \(f_1\) are not equal.
When fit.method = "optim"
or fit.method = "em"
,
maximum-likelihood estimates are computed using
optim
or using a fast EM algorithm, and test
statistics (standard errors, z-scores and p-values) are based on a
Laplace approximation to the likelihood at the MLE. These
calculations closely reproduce the test statistic calculations in
glm
.
We recommend setting shrink.method = "ash"
, which uses the
“adaptive shrinkage” method (Stephens, 2016) to improve
accuracy of the LFC estimates. The improvement in accuracy is
greatest for genes with low expression, or when the sample size is
small (or both). We follow the settings used in lfcShrink
from the DESeq2 package, with type = "ashr"
.
Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), kxw041. https://doi.org/10.1093/biostatistics/kxw041
Zhu, A., Ibrahim, J. G. and Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12), 2084–2092.