R/fit_topic_model.R
fit_topic_model.Rd
Fits a multinomial topic model to the count data,
hiding most of the complexities of model fitting. The default
optimization settings used here are intended to work well in a wide
range of data sets, although some fine-tuning may be needed for
more difficult cases. For full control, use fit_poisson_nmf
.
fit_topic_model(
X,
k,
numiter.main = 100,
numiter.refine = 100,
method.main = "em",
method.refine = "scd",
init.method = c("topicscore", "random"),
control.init = list(),
control.main = list(numiter = 4),
control.refine = list(numiter = 4, extrapolate = TRUE),
verbose = c("progressbar", "detailed", "none")
)
The n x m matrix of counts; all entries of X should be
non-negative. It can be a sparse matrix (class "dgCMatrix"
)
or dense matrix (class "matrix"
).
The number of topics. Must be 2 or greater.
Number of updates of the factors and loadings to perform in the main model fitting step. Should be 1 or more.
Number of updates of the factors and loadings to perform in the model refinement step.
The method to use for updating the factors and
loadings in the main model fitting step. Passed as argument
"method" to fit_poisson_nmf
.
The method to use for updating the factors in
evthe model refinement step. Passed as argument "method"
to fit_poisson_nmf
.
The method used to initialize the factors and
loadings. See init_poisson_nmf
for details.
A list of parameters controlling the behaviour
of the optimization and Topic SCORE method in the call to
init_poisson_nmf
. This is passed as argument "control" to
init_poisson_nmf
.
A list of parameters controlling the behaviour
of the optimization in the main model fitting step. This is passed
as argument "control" to fit_poisson_nmf
.
A list of parameters controlling the
behaviour of the of the optimization algorithm in the model
refinement step. This is passed as argument "control" to
fit_poisson_nmf
.
When verbose = "progressbar"
or verbose
= "detailed"
, information about the progress of the model fitting
is printed to the console. See fit_poisson_nmf
for
more information.
A multinomial topic model fit; see
poisson2multinom
and fit_poisson_nmf
for details. Note that outputted likelihoods and deviances in
progress
are evaluated with respect to the equivalent
Poisson NMF model.
With the default settings, the model fitting is
accomplished in four steps: (1) initialize the Poisson NMF model
fit (init_poisson_nmf
); (2) perform the main model fitting
step by running 100 EM updates using fit_poisson_nmf
; (3)
refine the fit by running 100 extrapolated SCD updates, again using
fit_poisson_nmf
; and (4) recover the multinomial topic model
by calling poisson2multinom
.
This two-stage fitting approach is based on our findings that the EM algorithm initially makes rapid progress toward a solution, but its convergence slows considerably as the iterates approach a solution. Close to a solution, we have found that other algorithms make much more rapid progress than EM; in particularly, we found that the extrapolated SCD updates usually performed best). For larger data sets, more updates in the main model fitting and refinement steps may be needed to obtain a good fit.
For larger data sets, more than 200 updates may be needed to obtain a good fit.
Dey, K. K., Hsiao, C. J. and Stephens, M. (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genetics 13, e1006599.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993-1022.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference, 50-57. doi:10.1145/312624.312649
library(Matrix)
set.seed(1)
X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X
fit <- fit_topic_model(X,k = 3)
#> Initializing factors using Topic SCORE algorithm.
#> Initializing loadings by running 10 SCD updates.
#> Fitting rank-3 Poisson NMF to 80 x 100 sparse matrix.
#> Running at most 100 EM updates, without extrapolation (fastTopics 0.6-175).
#>
#> [=======================>-----------------------------------------------] 34%
#>
#> [========================>----------------------------------------------] 35%
#>
#> [=========================>---------------------------------------------] 36%
#>
#> [=========================>---------------------------------------------] 37%
#>
#> [==========================>--------------------------------------------] 38%
#>
#> [===========================>-------------------------------------------] 39%
#>
#> [===========================>-------------------------------------------] 40%
#>
#> [============================>------------------------------------------] 41%
#>
#> [=============================>-----------------------------------------] 42%
#>
#> [==============================>----------------------------------------] 43%
#>
#> [==============================>----------------------------------------] 44%
#>
#> [===============================>---------------------------------------] 45%
#>
#> [================================>--------------------------------------] 46%
#>
#> [================================>--------------------------------------] 47%
#>
#> [=================================>-------------------------------------] 48%
#>
#> [==================================>------------------------------------] 49%
#>
#> [===================================>-----------------------------------] 50%
#>
#> [===================================>-----------------------------------] 51%
#>
#> [====================================>----------------------------------] 52%
#>
#> [=====================================>---------------------------------] 53%
#>
#> [=====================================>---------------------------------] 54%
#>
#> [======================================>--------------------------------] 55%
#>
#> [=======================================>-------------------------------] 56%
#>
#> [=======================================>-------------------------------] 57%
#>
#> [========================================>------------------------------] 58%
#>
#> [=========================================>-----------------------------] 59%
#>
#> [==========================================>----------------------------] 60%
#>
#> [==========================================>----------------------------] 61%
#>
#> [===========================================>---------------------------] 62%
#>
#> [============================================>--------------------------] 63%
#>
#> [============================================>--------------------------] 64%
#>
#> [=============================================>-------------------------] 65%
#>
#> [==============================================>------------------------] 66%
#>
#> [===============================================>-----------------------] 67%
#>
#> [===============================================>-----------------------] 68%
#>
#> [================================================>----------------------] 69%
#>
#> [=================================================>---------------------] 70%
#>
#> [=================================================>---------------------] 71%
#>
#> [==================================================>--------------------] 72%
#>
#> [===================================================>-------------------] 73%
#>
#> [====================================================>------------------] 74%
#>
#> [====================================================>------------------] 75%
#>
#> [=====================================================>-----------------] 76%
#>
#> [======================================================>----------------] 77%
#>
#> [======================================================>----------------] 78%
#>
#> [=======================================================>---------------] 79%
#>
#> [========================================================>--------------] 80%
#>
#> [=========================================================>-------------] 81%
#>
#> [=========================================================>-------------] 82%
#>
#> [==========================================================>------------] 83%
#>
#> [===========================================================>-----------] 84%
#>
#> [===========================================================>-----------] 85%
#>
#> [============================================================>----------] 86%
#>
#> [=============================================================>---------] 87%
#>
#> [=============================================================>---------] 88%
#>
#> [==============================================================>--------] 89%
#>
#> [===============================================================>-------] 90%
#>
#> [================================================================>------] 91%
#>
#> [================================================================>------] 92%
#>
#> [=================================================================>-----] 93%
#>
#> [==================================================================>----] 94%
#>
#> [==================================================================>----] 95%
#>
#> [===================================================================>---] 96%
#>
#> [====================================================================>--] 97%
#>
#> [=====================================================================>-] 98%
#>
#> [=====================================================================>-] 99%
#>
#> [=======================================================================] 100%
#>
#>
#> Refining model fit.
#> Fitting rank-3 Poisson NMF to 80 x 100 sparse matrix.
#> Running at most 100 SCD updates, with extrapolation (fastTopics 0.6-175).
#>
#> [===================>---------------------------------------------------] 28%
#>
#> [====================>--------------------------------------------------] 29%
#>
#> [====================>--------------------------------------------------] 30%
#>
#> [=====================>-------------------------------------------------] 31%
#>
#> [======================>------------------------------------------------] 32%
#>
#> [======================>------------------------------------------------] 33%
#>
#> [=======================>-----------------------------------------------] 34%
#>
#> [========================>----------------------------------------------] 35%
#>
#> [=========================>---------------------------------------------] 36%
#>
#> [=========================>---------------------------------------------] 37%
#>
#> [==========================>--------------------------------------------] 38%
#>
#> [===========================>-------------------------------------------] 39%
#>
#> [===========================>-------------------------------------------] 40%
#>
#> [============================>------------------------------------------] 41%
#>
#> [=============================>-----------------------------------------] 42%
#>
#> [==============================>----------------------------------------] 43%
#>
#> [==============================>----------------------------------------] 44%
#>
#> [===============================>---------------------------------------] 45%
#>
#> [================================>--------------------------------------] 46%
#>
#> [================================>--------------------------------------] 47%
#>
#> [=================================>-------------------------------------] 48%
#>
#> [==================================>------------------------------------] 49%
#>
#> [===================================>-----------------------------------] 50%
#>
#> [===================================>-----------------------------------] 51%
#>
#> [====================================>----------------------------------] 52%
#>
#> [=====================================>---------------------------------] 53%
#>
#> [=====================================>---------------------------------] 54%
#>
#> [======================================>--------------------------------] 55%
#>
#> [=======================================>-------------------------------] 56%
#>
#> [=======================================>-------------------------------] 57%
#>
#> [========================================>------------------------------] 58%
#>
#> [=========================================>-----------------------------] 59%
#>
#> [==========================================>----------------------------] 60%
#>
#> [==========================================>----------------------------] 61%
#>
#> [===========================================>---------------------------] 62%
#>
#> [============================================>--------------------------] 63%
#>
#> [============================================>--------------------------] 64%
#>
#> [=============================================>-------------------------] 65%
#>
#> [==============================================>------------------------] 66%
#>
#> [===============================================>-----------------------] 67%
#>
#> [===============================================>-----------------------] 68%
#>
#> [================================================>----------------------] 69%
#>
#> [=================================================>---------------------] 70%
#>
#> [=================================================>---------------------] 71%
#>
#> [==================================================>--------------------] 72%
#>
#> [===================================================>-------------------] 73%
#>
#> [====================================================>------------------] 74%
#>
#> [====================================================>------------------] 75%
#>
#> [=====================================================>-----------------] 76%
#>
#> [======================================================>----------------] 77%
#>
#> [======================================================>----------------] 78%
#>
#> [=======================================================>---------------] 79%
#>
#> [========================================================>--------------] 80%
#>
#> [=========================================================>-------------] 81%
#>
#> [=========================================================>-------------] 82%
#>
#> [==========================================================>------------] 83%
#>
#> [===========================================================>-----------] 84%
#>
#> [===========================================================>-----------] 85%
#>
#> [============================================================>----------] 86%
#>
#> [=============================================================>---------] 87%
#>
#> [=============================================================>---------] 88%
#>
#> [==============================================================>--------] 89%
#>
#> [===============================================================>-------] 90%
#>
#> [================================================================>------] 91%
#>
#> [================================================================>------] 92%
#>
#> [=================================================================>-----] 93%
#>
#> [==================================================================>----] 94%
#>
#> [==================================================================>----] 95%
#>
#> [===================================================================>---] 96%
#>
#> [====================================================================>--] 97%
#>
#> [=====================================================================>-] 98%
#>
#> [=====================================================================>-] 99%
#>
#> [=======================================================================] 100%
#>
#>
print(summary(fit))
#> Model overview:
#> Number of data rows, n: 80
#> Number of data cols, m: 100
#> Rank/number of topics, k: 3
#> Evaluation of model fit (200 updates performed):
#> Poisson NMF log-likelihood: -8.311490191230e+03
#> Multinomial topic model log-likelihood: -8.068320081877e+03
#> Poisson NMF deviance: +7.710268688987e+03
#> Max KKT residual: +2.984697e-07
#> Set show.size.factors = TRUE, show.mixprops = TRUE and/or show.topic.reps = TRUE in print(...) for more information