Fit a Poisson GLM-PCA model by maximum-likelihood.
fit_glmpca_pois(
Y,
K,
fit0 = init_glmpca_pois(Y, K),
verbose = TRUE,
control = list()
)
fit_glmpca_pois_control_default()
init_glmpca_pois(
Y,
K,
U,
V,
X = numeric(0),
Z = numeric(0),
B = numeric(0),
W = numeric(0),
fixed_b_cols = numeric(0),
fixed_w_cols = numeric(0),
col_size_factor = TRUE,
row_intercept = TRUE
)
The n x m matrix of counts; all entries of Y
should
be non-negative. It can be a sparse matrix (class
"dsparseMatrix"
) or dense matrix (class "matrix"
).
Integer 1 or greater specifying the rank of the matrix
factorization. This should only be provided if the initial fit
(fit0
) is not.
Initial model fit. It should be an object of class
“glmpca_fit_pois”, such as an output from
init_glmpca_pois
or a previous call to
fit_glmpca_pois
.
If verbose = TRUE
, information about the
algorithm's progress is printed after each update.
List of control parameters to modify behavior of the optimization algorithm; see “Details”.
An optional argument giving the initial estimate of the
loadings matrix. It should be an n x K matrix, where n is the
number of rows in the counts matrix Y
, and K > 0 is the rank
of the matrix factorization. When U
and V
are not
provided, input argument K
should be specified instead.
An optional argument giving is the initial estimate of the
factors matrix. It should be a m x K matrix, where m is the number
of columns in the counts matrix Y
, and K > 0 is the rank of
the matrix factorization. When U
and V
are not
provided, input argument K
should be specified instead.
Optional argument giving row covariates of the count
matrix Y
. It should be an n x nx matrix, where nx is
the number of row covariates.
Optional argument giving column covariates of the count
matrix Y
. It should be an m x nz matrix, where nz is the
number of column covariates.
Optional argument giving the initial estimates for the coefficients of the row covariates. It should be an m x nx matrix, where nx is the number of row covariates. This argument is ignored if X is not provided.
Optional argument giving the initial estimates for the coefficients of the column covariates. It should be an n x nz matrix, where nz is the number of column covariates. This argument is ignored if Z is not provided.
Optional numeric vector specifying which
columns of B
(if any) should be fixed during
optimization. This argument is ignored if X is not provided.
Optional numeric vector specifying which
columns of W
(if any) should be fixed during
optimization. This argument is ignored if Z is not provided.
If col_size_factor = TRUE
, add a
fixed factor accounting for average differences in Poisson rates
across columns of Y
. Setting col_size_factor = TRUE
and row_intercept = TRUE
is intended to replicate the
default behavior of glmpca
.
If row_intercept = TRUE
, add a factor
accounting for average differences in Poisson rates across rows of
Y
. Setting col_size_factor = TRUE
and
row_intercept = TRUE
is intended to replicate the default
behavior of glmpca
.
An object capturing the state of the model fit. It contains estimates of \(U\), \(V\) and \(D\) (stored as matrices
U
, V
and a vector of diagonal entries d
,
analogous to the svd
return value); the other
parameters (\(X\), \(B\), \(Z\), \(W\)); the log-likelihood
achieved (loglik
); information about which columns of
\(B\) and \(W\) are fixed (fixed_b_cols
,
fixed_w_cols
); and a data frame progress
storing
information about the algorithm's progress after each update.
In generalized principal component analysis (GLM-PCA) based on a Poisson likelihood, the counts \(y_{ij}\) stored in an \(n \times m\) matrix \(Y\) are modeled as $$y_{ij} \sim Pois(\lambda_{ij}),$$ in which the logarithm of each rate parameter \(\lambda_{ij}\) is defined as a linear combination of rank-K matrices to be estimated from the data: $$\log \lambda_{ij} = (UDV')_{ij},$$ where \(U\) and \(V\) are orthogonal matrices of dimension \(n \times K\) and \(m \times K\), respectively, and \(D\) is a diagonal \(K \times K\) matrix in which the entries along its diagonal are positive and decreasing. \(K\) is a tuning parameter specifying the rank of the matrix factorization. This is the same as the low-rank matrix decomposition underlying PCA (that is, the singular value decomposition), but because we are not using a linear (Gaussian) model, this is called “generalized PCA” or “GLM PCA”.
To allow for additional components that may be fixed,
fit_glmpca_pois
can also fit the more general model
$$\log \lambda_{ij} = (UDV' + XB' + WZ')_{ij},$$ in which
\(X\), \(Z\) are fixed matrices of dimension \(n \times
n_x\) and \(m \times n_z\), respectively, and
\(B\), \(W\) are matrices of dimension \(m \times n_x\) and \(n \times n_z\) to be estimated from the data.
fit_glmpca_pois
computes maximum-likelihood estimates (MLEs)
of \(U\), \(V\), \(D\), \(B\) and \(W\) satistifying the
orthogonality constraints for \(U\) and \(V\) and the
additional constraints on \(D\) that the entries are positive and
decreasing. This is accomplished by iteratively fitting a series of
Poisson GLMs, where each of these individual Poissons GLMs is fitted
using a fast “cyclic co-ordinate descent” (CCD) algorithm.
The control
argument is a list in which any of the following
named components will override the default optimization algorithm
settings (as they are defined by
fit_glmpca_pois_control_default
). Additional control
arguments not listed here can be used to control the behaviour of
fpiter
or daarem
; see
the help accompanying these functions for details.
use_daarem
If use_daarem = TRUE
, the updates
are accelerated using DAAREM; see daarem
for
details.
tol
This is the value of the “tol” control
argument for fpiter
or
daarem
that determines when to stop the
optimization. In brief, the optimization stops when the change in
the estimates or in the log-likelihood between two successive
updates is less than “tol”.
maxiter
This is the value of the “maxiter”
control argument for fpiter
or
daarem
. In brief, it sets the upper limit on
the number of CCD updates.
convtype
This is the value of the “convtype”
control argument for daarem
. It determines
whether the stopping criterion is based on the change in the
estimates or the change in the log-likelihood between two
successive updates.
mon.tol
This is the value of the “mon.tol”
control argument for daarem
. This setting
determines to what extent the monotonicity condition can be
violated.
num_ccd_iter
Number of co-ordinate descent updates to be made to parameters at each iteration of the algorithm.
line_search
If line_search = TRUE
, a
backtracking line search is performed at each iteration of CCD to
guarantee improvement in the objective (the log-likelihood).
ls_alpha
alpha parameter for backtracking line search. (Should be a number between 0 and 0.5, typically a number near zero.)
ls_beta
beta parameter for backtracking line search controlling the rate at which the step size is decreased. (Should be a number between 0 and 0.5.)
calc_deriv
If calc_deriv = TRUE
, the maximum
gradient of \(U\) and \(V\) is calculated and stored after each
update. This may be useful for assessing convergence of the
optimization, though increases overhead.
calc_max_diff
If calc_max_diff = TRUE
, the
largest change in \(U\) and \(V\) after each update is
calculated and stored. This may be useful for monitoring progress
of the optimization algorithm.
orthonormalize
If orthonormalize = TRUE
, the
matrices \(U\) and \(V\) are made to be orthogonal after each
update step. This improves the speed of convergence without the
DAAREM acceleration; however, should not be used when
use_daarem = TRUE
.
You may use function set_fastglmpca_threads
to adjust
the number of threads used in performing the updates.
Townes, F. W., Hicks, S. C., Aryee, M. J. and Irizarry, R. A. (2019). Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology 20, 295. doi:10.1186/s13059-019-1861-6
Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems 14.
fit_glmpca_pois
set.seed(1)
n <- 200
p <- 100
K <- 3
dat <- generate_glmpca_data_pois(n,p,K)
fit0 <- init_glmpca_pois(dat$Y,K)
fit <- fit_glmpca_pois(dat$Y,fit0 = fit0)
#> Fitting GLM-PCA model to 200 x 100 count matrix.
#> Iteration 1: log-likelihood = -9.150188547370e+04
#> Iteration 2: log-likelihood = -4.053287865351e+04
#> Iteration 3: log-likelihood = -3.613824011415e+04
#> Iteration 4: log-likelihood = -3.544135095378e+04
#> Iteration 5: log-likelihood = -3.524332761304e+04
#> Iteration 6: log-likelihood = -3.516524199745e+04
#> Iteration 7: log-likelihood = -3.512755108135e+04
#> Iteration 8: log-likelihood = -3.510690088392e+04
#> Iteration 9: log-likelihood = -3.509464345815e+04
#> Iteration 10: log-likelihood = -3.508696785545e+04
#> Iteration 11: log-likelihood = -3.508196851573e+04
#> Iteration 12: log-likelihood = -3.507860615708e+04
#> Iteration 13: log-likelihood = -3.507627979214e+04
#> Iteration 14: log-likelihood = -3.507462749014e+04
#> Iteration 15: log-likelihood = -3.507342462213e+04
#> Iteration 16: log-likelihood = -3.507252834177e+04
#> Iteration 17: log-likelihood = -3.507184585954e+04
#> Iteration 18: log-likelihood = -3.507131570026e+04
#> Iteration 19: log-likelihood = -3.507089635393e+04
#> Iteration 20: log-likelihood = -3.507055926109e+04
#> Iteration 21: log-likelihood = -3.507028440341e+04
#> Iteration 22: log-likelihood = -3.507005748682e+04
#> Iteration 23: log-likelihood = -3.506986811658e+04
#> Iteration 24: log-likelihood = -3.506970859857e+04
#> Iteration 25: log-likelihood = -3.506957313873e+04
#> Iteration 26: log-likelihood = -3.506945730176e+04
#> Iteration 27: log-likelihood = -3.506935764094e+04
#> Iteration 28: log-likelihood = -3.506927144065e+04
#> Iteration 29: log-likelihood = -3.506919653339e+04
#> Iteration 30: log-likelihood = -3.506913116934e+04
#> Iteration 31: log-likelihood = -3.506907392091e+04
#> Iteration 32: log-likelihood = -3.506902361308e+04
#> Iteration 33: log-likelihood = -3.506897927068e+04
#> Iteration 34: log-likelihood = -3.506894007835e+04
#> Iteration 35: log-likelihood = -3.506890535058e+04
#> Iteration 36: log-likelihood = -3.506887450650e+04
#> Iteration 37: log-likelihood = -3.506884705228e+04
#> Iteration 38: log-likelihood = -3.506882256615e+04
#> Iteration 39: log-likelihood = -3.506880068601e+04
#> Iteration 40: log-likelihood = -3.506878109978e+04
#> Iteration 41: log-likelihood = -3.506876353786e+04
#> Iteration 42: log-likelihood = -3.506874776663e+04
#> Iteration 43: log-likelihood = -3.506873358261e+04
#> Iteration 44: log-likelihood = -3.506872080834e+04
#> Iteration 45: log-likelihood = -3.506870928857e+04
#> Iteration 46: log-likelihood = -3.506869888712e+04
#> Iteration 47: log-likelihood = -3.506868948441e+04
#> Iteration 48: log-likelihood = -3.506868097488e+04
#> Iteration 49: log-likelihood = -3.506867326566e+04
#> Iteration 50: log-likelihood = -3.506866627429e+04
#> Iteration 51: log-likelihood = -3.506865992786e+04
#> Iteration 52: log-likelihood = -3.506865416168e+04
#> Iteration 53: log-likelihood = -3.506864891821e+04
#> Iteration 54: log-likelihood = -3.506864414612e+04
#> Iteration 55: log-likelihood = -3.506863979961e+04
#> Iteration 56: log-likelihood = -3.506863583749e+04
#> Iteration 57: log-likelihood = -3.506863222343e+04
#> Iteration 58: log-likelihood = -3.506862892469e+04
#> Iteration 59: log-likelihood = -3.506862591170e+04
#> Iteration 60: log-likelihood = -3.506862315804e+04
#> Iteration 61: log-likelihood = -3.506862063999e+04
#> Iteration 62: log-likelihood = -3.506861833595e+04
#> Iteration 63: log-likelihood = -3.506861622658e+04
#> Iteration 64: log-likelihood = -3.506861429453e+04
#> Iteration 65: log-likelihood = -3.506861252396e+04
#> Iteration 66: log-likelihood = -3.506861090065e+04
#> Iteration 67: log-likelihood = -3.506860941160e+04
#> Iteration 68: log-likelihood = -3.506860804513e+04
#> Iteration 69: log-likelihood = -3.506860679059e+04
#> Iteration 70: log-likelihood = -3.506860563833e+04
#> Iteration 71: log-likelihood = -3.506860457963e+04
#> Iteration 72: log-likelihood = -3.506860360654e+04
#> Iteration 73: log-likelihood = -3.506860271175e+04
#> Iteration 74: log-likelihood = -3.506860188870e+04
#> Iteration 75: log-likelihood = -3.506860113136e+04
#> Iteration 76: log-likelihood = -3.506860043428e+04
#> Iteration 77: log-likelihood = -3.506859979246e+04
#> Iteration 78: log-likelihood = -3.506859920134e+04
#> Iteration 79: log-likelihood = -3.506859865677e+04
#> Iteration 80: log-likelihood = -3.506859815491e+04
#> Iteration 81: log-likelihood = -3.506859769230e+04
#> Iteration 82: log-likelihood = -3.506859726579e+04
#> Iteration 83: log-likelihood = -3.506859687240e+04
#> Iteration 84: log-likelihood = -3.506859650954e+04
#> Iteration 85: log-likelihood = -3.506859617475e+04
#> Iteration 86: log-likelihood = -3.506859586570e+04
#> Iteration 87: log-likelihood = -3.506859558047e+04
#> Iteration 88: log-likelihood = -3.506859531707e+04
#> Iteration 89: log-likelihood = -3.506859507379e+04
#> Iteration 90: log-likelihood = -3.506859484907e+04
#> Iteration 91: log-likelihood = -3.506859464145e+04
#> Iteration 92: log-likelihood = -3.506859444962e+04
#> Iteration 93: log-likelihood = -3.506859427229e+04
#> Iteration 94: log-likelihood = -3.506859410832e+04
#> Iteration 95: log-likelihood = -3.506859395670e+04
#> Iteration 96: log-likelihood = -3.506859381645e+04
#> Iteration 97: log-likelihood = -3.506859368671e+04
#> Iteration 98: log-likelihood = -3.506859356667e+04
#> Iteration 99: log-likelihood = -3.506859345558e+04
#> Warning: fit_glmpca_pois failed to meet convergence criterion within 100 iterations