vignettes/binom_topic_model.Rmd
binom_topic_model.RmdIn this vignette, we show that fastTopics can be used to efficiently fit a binomial topic model for binary data—by “binary data”, we mean an binary data matrix ${\bf X}$ with entries .
The binomial topic model is in which each binomial probability is a linear combination of the parameters , such that is the proportion of row attributed to topic , and is the frequency of ones in column and topic .
In fastTopics, this binomial model is approximated by a Poisson non-negative matrix factorization (NMF). That is, we make the following approximation: This approximation will be good whenever the sample size is large and the binomial probabilities are small. We illustrate this idea in the example below.
Load the packages and set the seed so the results can be reproduced.
Simulate a 4,000 x 503 (sparse) binary matrix from a binomial topic model with 3 topics.
set.seed(1)
n <- 1000
m <- 100
L <- rbind(cbind(rep(1,n),rep(0,n),rep(0,n)),
cbind(rep(0,n),rep(1,n),rep(0,n)),
cbind(rep(0,n),rep(0,n),rep(1,n)),
cbind(runif(n),runif(n),runif(n)))
L <- L/rowSums(L)
F <- rbind(diag(3)/3,
cbind(c(rep(0.1,m),rep(0.1,m),rep(0.05,m),rep(0.0,m),rep(0.0,m)),
c(rep(0.0,m),rep(0.0,m),rep(0.05,m),rep(0.1,m),rep(0.0,m)),
c(rep(0.1,m),rep(0.0,m),rep(0.05,m),rep(0.0,m),rep(0.1,m))))
P <- L %*% t(F)
n <- nrow(P)
m <- ncol(P)
X <- matrix(rbinom(n*m,1,P),n,m)
X <- as(X,"dgCMatrix")
sim <- list(L = L,F = F,X = X)
mean(X)
# [1] 0.04355716Fit a Poisson NMF model to the binomial data. To simplify comparison with the true factorization, we “cheat” here and initialize to the true parameter values.
fit_pois <- init_poisson_nmf(X,L = L,F = F)
fit_pois <- fit_poisson_nmf(X,fit0 = fit_pois,
control = list(extrapolate = TRUE),
verbose = "none")Convert the Poisson NMF to a binomial topic model without any EM updates to refine the fit. (This step involves a simple rescaling of L and F, and should be very fast.)
fit_binom <- poisson2binom(X,fit_pois,numem = 0)Perform the conversion a second time, this time with some EM updates to refine the fit.
fit_binom_em <- poisson2binom(X,fit_pois,numem = 20)The EM updates for fitting the binomial topic model can be very slow for large matrices so we would like to avoid running the EM updates if we can. Since the binomial probabilitiies are small and the sample size is large in this example, the Poisson NMF model parameters should closely approximate the binomial topic model parameters. Indeed, this is the case:
par(mfrow = c(1,2))
plot(fit_binom_em$F,fit_binom$F,pch = 20,xlab = "without EM",
ylab = "with EM",main = "F")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
plot(fit_binom_em$L,fit_binom$L,pch = 20,xlab = "without EM",
ylab = "with EM",main = "L")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
One might be tempted to fit the standard (multinomial) topic model to these data,
fit_multinom <- poisson2multinom(fit_pois)but this results in estimates that are quite different from the binomial topic model:
par(mfrow = c(1,2))
plot(fit_binom_em$F,fit_multinom$F,pch = 20,
xlab = "binomial",ylab = "multinomial",main = "F")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
plot(fit_binom_em$L,fit_multinom$L,pch = 20,
xlab = "binomial",ylab = "multinomial",main = "L")
abline(a = 0,b = 1,col = "darkorange",lty = "dashed")
This suggests that using the multinomial topic model is the wrong thing to do if the data are indeed binomial.