vignettes/udr_intro.Rmd
udr_intro.Rmd
Here we illustrate the features of the Ultimate Deconvolution (UD) package.
We simulate 4,000 data points from a UD model with 4 mixture components: \[ x_i \sim w_1 N(0,V + U_1) + \cdots + w_4 N(0,V + U_4). \]
set.seed(1)
n <- 4000
V <- rbind(c(0.8,0.2),
c(0.2,1.5))
U <- list(none = rbind(c(0,0),
c(0,0)),
shared = rbind(c(1.0,0.9),
c(0.9,1.0)),
only1 = rbind(c(1,0),
c(0,0)),
only2 = rbind(c(0,0),
c(0,1)))
w <- c(0.8,0.1,0.075,0.025)
X <- simulate_ud_data(n,w,U,V)
This is the most basic usage, using the default settings:
Note: in this toy simulation example, we set model parameter \(V\) to \(V\) as we know the simulation truth. In practice, it’s likely one needs to estimate \(V\) and use \(\hat V\) for \(V\).
fit0 <- ud_init(X, V = V)
fit1 <- ud_fit(fit0)
# Performing Ultimate Deconvolution on 4000 x 2 matrix (udr 0.3-144, "R"):
# data points are i.i.d. (same V)
# prior covariances: 2 scaled, 0 rank-1, 8 unconstrained
# prior covariance updates: scaled (fa), rank-1 (none), unconstrained (ted)
# mixture weights update: em
# residual covariance update: none
# max 20 updates, tol=1.0e-06, tol.lik=1.0e-03
# iter log-likelihood log-likelihood.pen |w - w'| |U - U'| |V - V'|
# 1 -1.2279801614499656e+04 -1.2279801614499656e+04 4.68e-02 1.28e+01 0.00e+00
# 2 -1.2223121182887791e+04 -1.2223121182887791e+04 8.81e-03 5.10e-01 0.00e+00
# 3 -1.2212190514096219e+04 -1.2212190514096219e+04 6.38e-03 1.89e-01 0.00e+00
# 4 -1.2208915989941652e+04 -1.2208915989941652e+04 4.86e-03 8.34e-02 0.00e+00
# 5 -1.2207216178055960e+04 -1.2207216178055960e+04 4.05e-03 4.79e-02 0.00e+00
# 6 -1.2206156194755680e+04 -1.2206156194755680e+04 3.50e-03 3.68e-02 0.00e+00
# 7 -1.2205420599827859e+04 -1.2205420599827859e+04 3.08e-03 3.03e-02 0.00e+00
# 8 -1.2204895950208514e+04 -1.2204895950208514e+04 2.74e-03 2.46e-02 0.00e+00
# 9 -1.2204481484151787e+04 -1.2204481484151787e+04 2.48e-03 1.95e-02 0.00e+00
# 10 -1.2204146611350558e+04 -1.2204146611350558e+04 2.26e-03 1.51e-02 0.00e+00
# 11 -1.2203869862004891e+04 -1.2203869862004891e+04 2.07e-03 1.14e-02 0.00e+00
# 12 -1.2203634312857637e+04 -1.2203634312857637e+04 1.91e-03 8.31e-03 0.00e+00
# 13 -1.2203423126633246e+04 -1.2203423126633246e+04 1.78e-03 5.97e-03 0.00e+00
# 14 -1.2203229410058899e+04 -1.2203229410058899e+04 1.67e-03 4.45e-03 0.00e+00
# 15 -1.2203050790905421e+04 -1.2203050790905421e+04 1.59e-03 3.40e-03 0.00e+00
# 16 -1.2202885720507049e+04 -1.2202885720507049e+04 1.51e-03 2.64e-03 0.00e+00
# 17 -1.2202732934351641e+04 -1.2202732934351641e+04 1.43e-03 2.23e-03 0.00e+00
# 18 -1.2202591340657174e+04 -1.2202591340657174e+04 1.37e-03 2.09e-03 0.00e+00
# 19 -1.2202459977936309e+04 -1.2202459977936309e+04 1.30e-03 1.98e-03 0.00e+00
# 20 -1.2202337990145770e+04 -1.2202337990145770e+04 1.24e-03 1.89e-03 0.00e+00
The default model is a mixture with \(K = 10\) mixture components: 2 scaled scaled and 8 unconstrained prior covariance matrices \(U_k\):
summary(fit1)
# Model overview:
# Number of data points: 4000
# Dimension of data points: 2
# Data points are i.i.d. (same V)
# Number of scaled prior covariances: 2
# Number of rank-1 prior covariances: 0
# Number of unconstrained prior covariances: 8
# Evaluation of fit (20 updates performed):
# Log-likelihood: -1.220233799015e+04
The function ud_init
is a flexible interface for
defining a variety of UD models. For example,
fit0 <- ud_init(X,U_scaled = U,n_rank1 = 1,n_unconstrained = 1,V = V)
fit2 <- ud_fit(fit0)
summary(fit2)
# Performing Ultimate Deconvolution on 4000 x 2 matrix (udr 0.3-144, "R"):
# data points are i.i.d. (same V)
# prior covariances: 4 scaled, 1 rank-1, 1 unconstrained
# prior covariance updates: scaled (fa), rank-1 (ted), unconstrained (ted)
# mixture weights update: em
# residual covariance update: none
# max 20 updates, tol=1.0e-06, tol.lik=1.0e-03
# iter log-likelihood log-likelihood.pen |w - w'| |U - U'| |V - V'|
# 1 -1.2254274362870696e+04 -1.2254274362870696e+04 2.15e-02 1.01e+00 0.00e+00
# 2 -1.2234214947219918e+04 -1.2234214947219918e+04 7.12e-03 2.99e-01 0.00e+00
# 3 -1.2224394571088986e+04 -1.2224394571088986e+04 5.32e-03 1.62e-01 0.00e+00
# 4 -1.2219820969075176e+04 -1.2219820969075176e+04 4.73e-03 1.00e-01 0.00e+00
# 5 -1.2216889113988313e+04 -1.2216889113988313e+04 4.15e-03 6.61e-02 0.00e+00
# 6 -1.2214904923232063e+04 -1.2214904923232063e+04 3.61e-03 4.32e-02 0.00e+00
# 7 -1.2213781656532648e+04 -1.2213781656532648e+04 3.12e-03 4.90e-03 0.00e+00
# 8 -1.2212842079018839e+04 -1.2212842079018839e+04 2.83e-03 5.29e-09 0.00e+00
# 9 -1.2211992244080759e+04 -1.2211992244080759e+04 2.66e-03 1.01e-08 0.00e+00
# 10 -1.2211222160321957e+04 -1.2211222160321957e+04 2.53e-03 9.72e-09 0.00e+00
# 11 -1.2210522990459391e+04 -1.2210522990459391e+04 2.41e-03 9.72e-09 0.00e+00
# 12 -1.2209886923631164e+04 -1.2209886923631164e+04 2.30e-03 3.31e-24 0.00e+00
# 13 -1.2209307059168852e+04 -1.2209307059168852e+04 2.19e-03 3.31e-24 0.00e+00
# 14 -1.2208777302101074e+04 -1.2208777302101074e+04 2.09e-03 1.08e-08 0.00e+00
# 15 -1.2208292268713056e+04 -1.2208292268713056e+04 2.07e-03 0.00e+00 0.00e+00
# 16 -1.2207847203500649e+04 -1.2207847203500649e+04 2.06e-03 1.08e-08 0.00e+00
# 17 -1.2207437904917895e+04 -1.2207437904917895e+04 2.04e-03 5.63e-09 0.00e+00
# 18 -1.2207060657974238e+04 -1.2207060657974238e+04 2.02e-03 5.63e-09 0.00e+00
# 19 -1.2206712178043186e+04 -1.2206712178043186e+04 2.00e-03 1.08e-08 0.00e+00
# 20 -1.2206389557276261e+04 -1.2206389557276261e+04 1.97e-03 3.31e-24 0.00e+00
# Model overview:
# Number of data points: 4000
# Dimension of data points: 2
# Data points are i.i.d. (same V)
# Number of scaled prior covariances: 4
# Number of rank-1 prior covariances: 1
# Number of unconstrained prior covariances: 1
# Evaluation of fit (20 updates performed):
# Log-likelihood: -1.220638955728e+04
The ud_fit
interface also allows for flexibly re-fitting
a model; that is, a ud_fit
output can be used in another
call to ud_fit
. There is also a more advanced model fitting
interface for more fine-grained control; see
help(ud_fit_advanced)
for details.
Notice that the data do not need to be passed to ud_fit
because they are already embedded in the ud_fit
output.
However, one can also pass a new data set X
to
ud_fit
.
The control
argument to ud_fit
controls the
optimization of the model parameters. For example, the unconstrained
prior covariance matrices are by default optimized using the “truncated
eigenvalue decomposition” (TED) updates, and we can replace these with
the Extreme Deconvolution (ED) updates:
fit3 <- ud_fit(fit0,control = list(unconstrained.update = "ed"))
# Performing Ultimate Deconvolution on 4000 x 2 matrix (udr 0.3-144, "R"):
# data points are i.i.d. (same V)
# prior covariances: 4 scaled, 1 rank-1, 1 unconstrained
# prior covariance updates: scaled (fa), rank-1 (ted), unconstrained (ed)
# mixture weights update: em
# residual covariance update: none
# max 20 updates, tol=1.0e-06, tol.lik=1.0e-03
# iter log-likelihood log-likelihood.pen |w - w'| |U - U'| |V - V'|
# 1 -1.2287368932550235e+04 -1.2287368932550235e+04 2.15e-02 5.02e-01 0.00e+00
# 2 -1.2267084706127493e+04 -1.2267084706127493e+04 1.19e-02 2.29e-01 0.00e+00
# 3 -1.2254869264065783e+04 -1.2254869264065783e+04 8.55e-03 1.37e-01 0.00e+00
# 4 -1.2246443004794617e+04 -1.2246443004794617e+04 7.70e-03 9.27e-02 0.00e+00
# 5 -1.2240164280190913e+04 -1.2240164280190913e+04 7.05e-03 6.75e-02 0.00e+00
# 6 -1.2235254049624667e+04 -1.2235254049624667e+04 6.52e-03 5.16e-02 0.00e+00
# 7 -1.2231287048520800e+04 -1.2231287048520800e+04 6.07e-03 4.08e-02 0.00e+00
# 8 -1.2228007009908366e+04 -1.2228007009908366e+04 5.68e-03 3.31e-02 0.00e+00
# 9 -1.2225247786758919e+04 -1.2225247786758919e+04 5.33e-03 2.73e-02 0.00e+00
# 10 -1.2222895469731768e+04 -1.2222895469731768e+04 5.01e-03 2.30e-02 0.00e+00
# 11 -1.2220868482802136e+04 -1.2220868482802136e+04 4.72e-03 1.95e-02 0.00e+00
# 12 -1.2219106350217282e+04 -1.2219106350217282e+04 4.46e-03 1.68e-02 0.00e+00
# 13 -1.2217562978153386e+04 -1.2217562978153386e+04 4.22e-03 1.45e-02 0.00e+00
# 14 -1.2216202436938260e+04 -1.2216202436938260e+04 3.99e-03 1.27e-02 0.00e+00
# 15 -1.2214996199525513e+04 -1.2214996199525513e+04 3.78e-03 1.12e-02 0.00e+00
# 16 -1.2213921270641369e+04 -1.2213921270641369e+04 3.59e-03 9.87e-03 0.00e+00
# 17 -1.2212958877040135e+04 -1.2212958877040135e+04 3.41e-03 8.76e-03 0.00e+00
# 18 -1.2212093527740029e+04 -1.2212093527740029e+04 3.24e-03 7.82e-03 0.00e+00
# 19 -1.2211312324993052e+04 -1.2211312324993052e+04 3.08e-03 7.13e-03 0.00e+00
# 20 -1.2210604444683637e+04 -1.2210604444683637e+04 2.93e-03 6.61e-03 0.00e+00
Notice that the likelihood is slightly higher with the TED updates, reflecting the fact that the TED updates have greater freedom to optimize the covariance matrices.
These are the current model fitting defaults:
unlist(ud_fit_control_default())
# weights.update resid.update scaled.update
# NA NA NA
# rank1.update unconstrained.update version
# NA NA "R"
# maxiter minval tol
# "20" "1e-08" "1e-06"
# tol.lik zero.threshold update.threshold
# "0.001" "1e-15" "0.001"
# lambda penalty.type
# "0" "iw"
The most intensive computations are implemented in R
(control$version = R
) and C++
(control$version = Rcpp
); both versions should return the
same, or nearly the same output, but the C++ version can sometimes be
much faster.
In some cases each sample might have different measurement error, and
udr can handle such cases by specifying V
as a list instead
of a matrix:
fit0 <- ud_init(X,V = rep(list(V),n))
fit4 <- ud_fit(fit0)
# Performing Ultimate Deconvolution on 4000 x 2 matrix (udr 0.3-144, "R"):
# data points are not i.i.d. (different Vs)
# prior covariances: 2 scaled, 0 rank-1, 8 unconstrained
# prior covariance updates: scaled (none), rank-1 (none), unconstrained (none)
# mixture weights update: em
# max 20 updates, tol=1.0e-06, tol.lik=1.0e-03
# iter log-likelihood log-likelihood.pen |w - w'| |U - U'| |V - V'|
# 1 -1.3805879530064036e+04 -1.3805879530064036e+04 5.94e-02 0.00e+00 0.00e+00
# 2 -1.3490340592972956e+04 -1.3490340592972956e+04 6.76e-02 0.00e+00 0.00e+00
# 3 -1.3250364428611714e+04 -1.3250364428611714e+04 6.86e-02 0.00e+00 0.00e+00
# 4 -1.3071921980151610e+04 -1.3071921980151610e+04 6.37e-02 0.00e+00 0.00e+00
# 5 -1.2941705117707856e+04 -1.2941705117707856e+04 5.54e-02 0.00e+00 0.00e+00
# 6 -1.2847990772326337e+04 -1.2847990772326337e+04 4.59e-02 0.00e+00 0.00e+00
# 7 -1.2781081111651431e+04 -1.2781081111651431e+04 3.70e-02 0.00e+00 0.00e+00
# 8 -1.2733419541665267e+04 -1.2733419541665267e+04 2.94e-02 0.00e+00 0.00e+00
# 9 -1.2699392526575763e+04 -1.2699392526575763e+04 2.31e-02 0.00e+00 0.00e+00
# 10 -1.2674965124613851e+04 -1.2674965124613851e+04 1.82e-02 0.00e+00 0.00e+00
# 11 -1.2657294509511747e+04 -1.2657294509511747e+04 1.44e-02 0.00e+00 0.00e+00
# 12 -1.2644397982178429e+04 -1.2644397982178429e+04 1.15e-02 0.00e+00 0.00e+00
# 13 -1.2634896984685392e+04 -1.2634896984685392e+04 9.24e-03 0.00e+00 0.00e+00
# 14 -1.2627831240020840e+04 -1.2627831240020840e+04 7.53e-03 0.00e+00 0.00e+00
# 15 -1.2622528327349333e+04 -1.2622528327349333e+04 6.20e-03 0.00e+00 0.00e+00
# 16 -1.2618513854100438e+04 -1.2618513854100438e+04 5.18e-03 0.00e+00 0.00e+00
# 17 -1.2615450149042339e+04 -1.2615450149042339e+04 4.38e-03 0.00e+00 0.00e+00
# 18 -1.2613094535230090e+04 -1.2613094535230090e+04 3.75e-03 0.00e+00 0.00e+00
# 19 -1.2611270880733064e+04 -1.2611270880733064e+04 3.25e-03 0.00e+00 0.00e+00
# 20 -1.2609850106127040e+04 -1.2609850106127040e+04 2.85e-03 0.00e+00 0.00e+00
The TED updates cannot be applied to this setting, so the ED updates must be used instead.
This is the version of R and the packages that were used to generate these results:
sessionInfo()
# R version 4.2.0 (2022-04-22)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
#
# Matrix products: default
# BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] udr_0.3-147
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.8.3 knitr_1.39 magrittr_2.0.3 R6_2.5.1
# [5] ragg_1.2.5 rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0
# [9] tools_4.2.0 xfun_0.30 cli_3.3.0 jquerylib_0.1.4
# [13] htmltools_0.5.2 systemfonts_1.0.4 yaml_2.3.5 digest_0.6.29
# [17] rprojroot_2.0.3 pkgdown_2.0.7 textshaping_0.3.6 purrr_0.3.4
# [21] sass_0.4.1 fs_1.5.2 memoise_2.0.1 cachem_1.0.6
# [25] evaluate_0.15 rmarkdown_2.14 stringi_1.7.6 compiler_4.2.0
# [29] bslib_0.3.1 desc_1.4.1 mvtnorm_1.1-3 jsonlite_1.8.0