R/datasim.R
simulate_gene_data.Rd
Simulate count data from a Poisson NMF model or multinomial topic model, in which topics represent “gene expression programs”, and gene expression programs are characterized by different rates of expression. The way in which the counts are simulated is modeled after gene expression studies in which expression is measured by single-cell RNA sequencing (“RNA-seq”) techniques: each row of the counts matrix corresponds a gene expression profile, each column corresponds to a gene, and each matrix element is a “read count”, or “UMI count”, measuring expression level. Factors are simulated so as to capture realistic changes in gene expression across different cell types. See “Details” for the procedure used to simulate factors, loadings and counts.
simulate_poisson_gene_data(n, m, k, s, p = 1, sparse = FALSE)
simulate_multinom_gene_data(n, m, k, sparse = FALSE)
Number of rows in the simulated count matrix. Should be at least 2.
Number of columns in the simulated count matrix. Should be at least 2.
Number of factors, or “topics”, used to generate the data. Should be 2 or more.
Vector of “size factors”; each row of the loadings
matrix L
is scaled by the entries of s
before
generating the counts. This should be a vector of length n
containing only positive values.
Probability that F[i,j]
is equal to the mean rate.
Smaller values of p
will result in more factors that are the
same across topics.
If sparse = TRUE
, convert the counts matrix to
a sparse matrix in compressed, column-oriented format; see
sparseMatrix
.
simulate_poisson_gene_data
returns a list containing
the counts matrix X
, and the size factors s
and
factorization, F
, L
, used to generate the counts.
simulate_multinom_gene_data
returns a list containing the
counts matrix X
, and the mixture proportions L
and
factors (gene probabilities, or relative gene expression levels)
F
used to generate the counts.
Here we describe the process for generating the n x k
loadings matrix L
and the m x k factors matrix F
.
Each row of the L
matrix is generated in the following
manner: (1) the number of nonzero mixture proportions is \(1
\le n \le k\), with probability proportional to \(2^{-n}\);
(2) the indices of the nonzero mixture proportions are sampled
uniformly at random; and (3) the nonzero mixture proportions are
sampled from the Dirichlet distribution with \(\alpha = 1\) (so
that all topics are equally likely).
Each row of the factors matrix are generated according to the following procedure: (1) generate \(u = |r| - 5\), where \(r ~ N(0,2)\); (2) for each topic \(k\), generate the Poisson rates as \(exp(max(t,-5))\), where \(t ~ 0.95 * N(u,s/10) + 0.05 * N(u,s)\), and \(s = exp(-u/8)\). Factors can be interpreted as Poisson rates or multinomial probabilities, so that individual counts can be viewed as being generated from a weighted mixture of “topics” with different rates or probabilities.
Once the loadings and factors have been generated, the counts are
simulated from either the Poisson NMF or multinomial topic model:
for the former, X[i,j]
is Poisson with rate Y[i,j]
,
where Y = tcrossprod(L,F)
; for the latter, X[i,]
is
multinomial with size s[i]
and with class probabilities
P[i,]
, where P = tcrossprod(L,F)
. For the multinomial
model only, the sizes s
are randomly generated as s =
10^rnorm(n,3,0.2)
.
Note that only minimal argument checking is performed; the function is mainly used to test implementation of the topic-model-based differential count analysis.