Simulate Gene Expression Data from Poisson NMF or Multinomial Topic Model

Simulate count data from a Poisson NMF model or multinomial topic model, in which topics represent “gene expression programs”, and gene expression programs are characterized by different rates of expression. The way in which the counts are simulated is modeled after gene expression studies in which expression is measured by single-cell RNA sequencing (“RNA-seq”) techniques: each row of the counts matrix corresponds a gene expression profile, each column corresponds to a gene, and each matrix element is a “read count”, or “UMI count”, measuring expression level. Factors are simulated so as to capture realistic changes in gene expression across different cell types. See “Details” for the procedure used to simulate factors, loadings and counts.

simulate_poisson_gene_data(n, m, k, s, p = 1, sparse = FALSE)

simulate_multinom_gene_data(n, m, k, sparse = FALSE)

Arguments

n: Number of rows in the simulated count matrix. Should be at least 2.
m: Number of columns in the simulated count matrix. Should be at least 2.
k: Number of factors, or “topics”, used to generate the data. Should be 2 or more.
s: Vector of “size factors”; each row of the loadings matrix L is scaled by the entries of s before generating the counts. This should be a vector of length n containing only positive values.
p: Probability that F[i,j] is equal to the mean rate. Smaller values of p will result in more factors that are the same across topics.
sparse: If sparse = TRUE, convert the counts matrix to a sparse matrix in compressed, column-oriented format; see sparseMatrix.

Value

simulate_poisson_gene_data returns a list containing the counts matrix X, and the size factors s and factorization, F, L, used to generate the counts. simulate_multinom_gene_data returns a list containing the counts matrix X, and the mixture proportions L and factors (gene probabilities, or relative gene expression levels) F used to generate the counts.

Details

Here we describe the process for generating the n x k loadings matrix L and the m x k factors matrix F.

Each row of the L matrix is generated in the following manner: (1) the number of nonzero mixture proportions is \(1 \le n \le k\), with probability proportional to \(2^{-n}\); (2) the indices of the nonzero mixture proportions are sampled uniformly at random; and (3) the nonzero mixture proportions are sampled from the Dirichlet distribution with \(\alpha = 1\) (so that all topics are equally likely).

Each row of the factors matrix are generated according to the following procedure: (1) generate \(u = |r| - 5\), where \(r ~ N(0,2)\); (2) for each topic \(k\), generate the Poisson rates as \(exp(max(t,-5))\), where \(t ~ 0.95 * N(u,s/10) + 0.05 * N(u,s)\), and \(s = exp(-u/8)\). Factors can be interpreted as Poisson rates or multinomial probabilities, so that individual counts can be viewed as being generated from a weighted mixture of “topics” with different rates or probabilities.

Once the loadings and factors have been generated, the counts are simulated from either the Poisson NMF or multinomial topic model: for the former, X[i,j] is Poisson with rate Y[i,j], where Y = tcrossprod(L,F); for the latter, X[i,] is multinomial with size s[i] and with class probabilities P[i,], where P = tcrossprod(L,F). For the multinomial model only, the sizes s are randomly generated as s = 10^rnorm(n,3,0.2).

Note that only minimal argument checking is performed; the function is mainly used to test implementation of the topic-model-based differential count analysis.