In this vignette, we show that fastTopics can be used to efficiently fit a binomial topic model for binary data—by “binary data”, we mean an n×mn \times m binary data matrix ${\bf X}$ with entries xij{0,1}x_{ij} \in \{0, 1\}.

The binomial topic model is xijBinom(1,πij), x_{ij} \sim \mathrm{Binom}(1,\pi_{ij}), in which each binomial probability πij\pi_{ij} is a linear combination of the parameters fjk,likf_{jk}, l_{ik}, πij=k=1Klikfjk, \pi_{ij} = \sum_{k=1}^K l_{ik} f_{jk}, such that likl_{ik} is the proportion of row ii attributed to topic kk, and fjkf_{jk} is the frequency of ones in column jj and topic kk.

In fastTopics, this binomial model is approximated by a Poisson non-negative matrix factorization (NMF). That is, we make the following approximation: Binom(1,πij)Pois(πij) \mathrm{Binom}(1,\pi_{ij}) \approx \mathrm{Pois}(\pi_{ij}) This approximation will be good whenever the sample size is large and the binomial probabilities are small. We illustrate this idea in the example below.

Load the packages and set the seed so the results can be reproduced.

Simulate a 4,000 x 503 (sparse) binary matrix from a binomial topic model with 3 topics.

set.seed(1)
n <- 1000
m <- 100
L <- rbind(cbind(rep(1,n),rep(0,n),rep(0,n)),
           cbind(rep(0,n),rep(1,n),rep(0,n)),
           cbind(rep(0,n),rep(0,n),rep(1,n)),
           cbind(runif(n),runif(n),runif(n)))
L <- L/rowSums(L)
F <- rbind(diag(3)/3,
           cbind(c(rep(0.1,m),rep(0.1,m),rep(0.05,m),rep(0.0,m),rep(0.0,m)),
                 c(rep(0.0,m),rep(0.0,m),rep(0.05,m),rep(0.1,m),rep(0.0,m)),
                 c(rep(0.1,m),rep(0.0,m),rep(0.05,m),rep(0.0,m),rep(0.1,m))))
P <- L %*% t(F)
n <- nrow(P)
m <- ncol(P)
X <- matrix(rbinom(n*m,1,P),n,m)
X <- as(X,"dgCMatrix")
sim <- list(L = L,F = F,X = X)
mean(X)
# [1] 0.04355716

Fit a Poisson NMF model to the binomial data. To simplify comparison with the true factorization, we “cheat” here and initialize to the true parameter values.

fit_pois <- init_poisson_nmf(X,L = L,F = F)
fit_pois <- fit_poisson_nmf(X,fit0 = fit_pois,
                            control = list(extrapolate = TRUE),
                            verbose = "none")

Convert the Poisson NMF to a binomial topic model without any EM updates to refine the fit. (This step involves a simple rescaling of L and F, and should be very fast.)

fit_binom <- poisson2binom(X,fit_pois,numem = 0)

Perform the conversion a second time, this time with some EM updates to refine the fit.

fit_binom_em <- poisson2binom(X,fit_pois,numem = 20)

The EM updates for fitting the binomial topic model can be very slow for large matrices so we would like to avoid running the EM updates if we can. Since the binomial probabilitiies are small and the sample size is large in this example, the Poisson NMF model parameters should closely approximate the binomial topic model parameters. Indeed, this is the case:

par(mfrow = c(1,2))
plot(fit_binom_em$F,fit_binom$F,pch = 20,xlab = "without EM",
     ylab = "with EM",main = "F")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
plot(fit_binom_em$L,fit_binom$L,pch = 20,xlab = "without EM",
     ylab = "with EM",main = "L")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")

One might be tempted to fit the standard (multinomial) topic model to these data,

fit_multinom <- poisson2multinom(fit_pois)

but this results in estimates that are quite different from the binomial topic model:

par(mfrow = c(1,2))
plot(fit_binom_em$F,fit_multinom$F,pch = 20,
     xlab = "binomial",ylab = "multinomial",main = "F")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
plot(fit_binom_em$L,fit_multinom$L,pch = 20,
     xlab = "binomial",ylab = "multinomial",main = "L")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")

This suggests that using the multinomial topic model is the wrong thing to do if the data are indeed binomial.