R/embeddings.R
embeddings_from_topics.Rd
Lightweight interface for rapidly producing low-dimensional embeddings from matrix factorizations or multinomial topic models. The defaults used are more suitable for producing embeddings from matrix factorizations or topic models.
pca_from_topics(fit, dims = 2, center = TRUE, scale. = FALSE, ...)
tsne_from_topics(
fit,
dims = 2,
pca = FALSE,
normalize = FALSE,
perplexity = 100,
theta = 0.1,
max_iter = 1000,
eta = 200,
check_duplicates = FALSE,
verbose = TRUE,
...
)
umap_from_topics(
fit,
dims = 2,
n_neighbors = 30,
metric = "euclidean",
scale = "none",
pca = NULL,
verbose = TRUE,
...
)
An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”.
The number of dimensions in the embedding. In
tsne_from_topics
, this is passed as argument “dims”
to Rtsne
. In umap_from_topics
, this is
passed as argument “n_components” to
umap
.
A logical value indicating whether columns of
fit$L
should be zero-centered before performing PCA; passed
as argument “center” to prcomp
.
A logical value indicating whether columns of
fit$L
should be scaled to have unit variance prior to
performing PCA; passed as argument “scale.” to
prcomp
.
Whether to perform a PCA processing step in t-SNE or
UMAP; passed as argument “pca” to Rtsne
or umap
.
Whether to normalize the data prior to running
t-SNE; passed as argument “normalize” to
Rtsne
.
t-SNE perplexity parameter, passed as argument
“perplexity” to Rtsne
. The perplexity
is automatically revised if it is too large; see
Rtsne
for more information.
t-SNE speed/accuracy trade-off parameter; passed as
argument “theta” to Rtsne
.
Maximum number of t-SNE iterations; passed as
argument “max_iter” to Rtsne
.
t-SNE learning rate parameter; passed as argument
“eta” to Rtsne
.
When check_duplicates = TRUE
, checks
whether there are duplicate rows in fit$L
; passed as argument
“check_duplicates” to Rtsne
.
If verbose = TRUE
, progress updates are
printed; passed as argument “verbose” to
Rtsne
or umap
.
Number of nearest neighbours in manifold
approximation; passed as argument “n_neighbors” to
umap
.
Distance matrix used to find nearest neighbors;
passed as argument “metric” to
umap
.
Scaling to apply to fit$L
; passed as argument
“scale” to umap
.
An n x d matrix containing the embedding, where n is the
number of rows of fit$L
, and d = dims
.
Note that since tsne_from_topics
and
umap_from_topics
use nonlinear transformations of the data,
distances between points are generally less interpretable than a
linear transformation obtained by, say, PCA.
Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications 10, 5416. doi:10.1038/s41467-019-13056-x
library(ggplot2)
library(cowplot)
set.seed(1)
data(pbmc_facs)
# Get the Poisson NMF and multinomial topic model fit to the PBMC data.
fit1 <- multinom2poisson(pbmc_facs$fit)
fit2 <- pbmc_facs$fit
fit2 <- poisson2multinom(fit1)
# Compute the first two PCs of the loadings matrix (for the topic
# model, fit2, the loadings are the topic proportions).
Y1 <- pca_from_topics(fit1)
Y2 <- pca_from_topics(fit2)
subpop <- pbmc_facs$samples$subpop
quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot()
#> Warning: `qplot()` was deprecated in ggplot2 3.4.0.
quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot()
# Compute a 2-d embedding of the loadings using t-SNE.
# \donttest{
Y1 <- tsne_from_topics(fit1)
#> Read the 3774 x 6 data matrix successfully!
#> Using no_dims = 2, perplexity = 100.000000, and theta = 0.100000
#> Computing input similarities...
#> Building tree...
#> Done in 0.97 seconds (sparsity = 0.103577)!
#> Learning embedding...
#> Iteration 50: error is 67.907680 (50 iterations in 2.31 seconds)
#> Iteration 100: error is 55.093036 (50 iterations in 1.63 seconds)
#> Iteration 150: error is 53.522705 (50 iterations in 1.56 seconds)
#> Iteration 200: error is 52.982723 (50 iterations in 1.54 seconds)
#> Iteration 250: error is 52.700813 (50 iterations in 1.54 seconds)
#> Iteration 300: error is 0.869868 (50 iterations in 1.89 seconds)
#> Iteration 350: error is 0.639018 (50 iterations in 2.04 seconds)
#> Iteration 400: error is 0.537143 (50 iterations in 2.01 seconds)
#> Iteration 450: error is 0.482721 (50 iterations in 2.01 seconds)
#> Iteration 500: error is 0.449646 (50 iterations in 1.98 seconds)
#> Iteration 550: error is 0.427580 (50 iterations in 1.97 seconds)
#> Iteration 600: error is 0.412004 (50 iterations in 1.95 seconds)
#> Iteration 650: error is 0.400458 (50 iterations in 1.93 seconds)
#> Iteration 700: error is 0.391642 (50 iterations in 1.95 seconds)
#> Iteration 750: error is 0.384759 (50 iterations in 1.96 seconds)
#> Iteration 800: error is 0.379201 (50 iterations in 1.97 seconds)
#> Iteration 850: error is 0.374679 (50 iterations in 1.97 seconds)
#> Iteration 900: error is 0.370959 (50 iterations in 1.98 seconds)
#> Iteration 950: error is 0.367860 (50 iterations in 2.10 seconds)
#> Iteration 1000: error is 0.365237 (50 iterations in 1.97 seconds)
#> Fitting performed in 38.27 seconds.
Y2 <- tsne_from_topics(fit2)
#> Read the 3774 x 6 data matrix successfully!
#> Using no_dims = 2, perplexity = 100.000000, and theta = 0.100000
#> Computing input similarities...
#> Building tree...
#> Done in 0.98 seconds (sparsity = 0.101133)!
#> Learning embedding...
#> Iteration 50: error is 65.682366 (50 iterations in 2.66 seconds)
#> Iteration 100: error is 53.576821 (50 iterations in 1.66 seconds)
#> Iteration 150: error is 51.735807 (50 iterations in 1.52 seconds)
#> Iteration 200: error is 51.078410 (50 iterations in 1.45 seconds)
#> Iteration 250: error is 50.736216 (50 iterations in 1.43 seconds)
#> Iteration 300: error is 0.804715 (50 iterations in 1.77 seconds)
#> Iteration 350: error is 0.580748 (50 iterations in 1.85 seconds)
#> Iteration 400: error is 0.481837 (50 iterations in 1.83 seconds)
#> Iteration 450: error is 0.429849 (50 iterations in 1.81 seconds)
#> Iteration 500: error is 0.398384 (50 iterations in 1.81 seconds)
#> Iteration 550: error is 0.377602 (50 iterations in 1.81 seconds)
#> Iteration 600: error is 0.362985 (50 iterations in 1.81 seconds)
#> Iteration 650: error is 0.352226 (50 iterations in 1.81 seconds)
#> Iteration 700: error is 0.344063 (50 iterations in 1.77 seconds)
#> Iteration 750: error is 0.337688 (50 iterations in 1.75 seconds)
#> Iteration 800: error is 0.332612 (50 iterations in 1.74 seconds)
#> Iteration 850: error is 0.328442 (50 iterations in 1.74 seconds)
#> Iteration 900: error is 0.325026 (50 iterations in 1.74 seconds)
#> Iteration 950: error is 0.322085 (50 iterations in 1.73 seconds)
#> Iteration 1000: error is 0.319664 (50 iterations in 1.72 seconds)
#> Fitting performed in 35.44 seconds.
quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot()
quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot()
# }
# Compute a 2-d embedding of the loadings using UMAP.
# \donttest{
Y1 <- umap_from_topics(fit1)
#> 09:22:32 UMAP embedding parameters a = 1.896 b = 0.8006
#> 09:22:32 Read 3774 rows and found 6 numeric columns
#> 09:22:32 Using FNN for neighbor search, n_neighbors = 30
#> 09:22:32 Commencing smooth kNN distance calibration using 4 threads
#> with target n_neighbors = 30
#> 09:22:32 Initializing from normalized Laplacian + noise (using irlba)
#> 09:22:33 Commencing optimization for 500 epochs, with 139662 positive edges
#> 09:22:37 Optimization finished
Y2 <- umap_from_topics(fit2)
#> 09:22:37 UMAP embedding parameters a = 1.896 b = 0.8006
#> 09:22:37 Read 3774 rows and found 6 numeric columns
#> 09:22:37 Using FNN for neighbor search, n_neighbors = 30
#> 09:22:38 Commencing smooth kNN distance calibration using 4 threads
#> with target n_neighbors = 30
#> 09:22:38 110 smooth knn distance failures
#> 09:22:38 Initializing from normalized Laplacian + noise (using irlba)
#> 09:22:38 Commencing optimization for 500 epochs, with 137990 positive edges
#> 09:22:43 Optimization finished
quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot()
quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot()
# }