PCA, t-SNE and UMAP Plots

Visualize the structure of the Poisson NMF loadings or the multinomial topic model topic proportions by projection onto a 2-d surface. plot_hexbin_plot is most useful for visualizing the PCs of a data set with thousands of samples or more.

embedding_plot_2d(
  fit,
  Y,
  fill = "loading",
  k,
  fill.label,
  ggplot_call = embedding_plot_2d_ggplot_call,
  plot_grid_call = function(plots) do.call(plot_grid, plots)
)

embedding_plot_2d_ggplot_call(
  Y,
  fill,
  fill.type = c("loading", "numeric", "factor", "none"),
  fill.label,
  font.size = 9
)

pca_plot(
  fit,
  Y,
  pcs = 1:2,
  n = 10000,
  fill = "loading",
  k,
  fill.label,
  ggplot_call = embedding_plot_2d_ggplot_call,
  plot_grid_call = function(plots) do.call(plot_grid, plots),
  ...
)

tsne_plot(
  fit,
  Y,
  n = 2000,
  fill = "loading",
  k,
  fill.label,
  ggplot_call = embedding_plot_2d_ggplot_call,
  plot_grid_call = function(plots) do.call(plot_grid, plots),
  ...
)

umap_plot(
  fit,
  Y,
  n = 2000,
  fill = "loading",
  k,
  fill.label,
  ggplot_call = embedding_plot_2d_ggplot_call,
  plot_grid_call = function(plots) do.call(plot_grid, plots),
  ...
)

pca_hexbin_plot(
  fit,
  Y,
  pcs = 1:2,
  bins = 40,
  breaks = c(0, 1, 10, 100, 1000, Inf),
  ggplot_call = pca_hexbin_plot_ggplot_call,
  ...
)

pca_hexbin_plot_ggplot_call(Y, bins, breaks, font.size = 9)

Arguments

fit: An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”.
Y: The n x 2 matrix containing the 2-d embedding, where n is the number of rows in fit$L. If not provided, the embedding will be computed automatically.
fill: The quantity to map onto the fill colour of the points in the PCA plot. Set fill = "loading" to vary the fill colour according to the loadings (or topic proportions) of the select topiced or topics. Alternatively, fill may be set to a data vector with one entry per row of fit$L, in which case the data are mapped to the fill colour of the points. When fill = "none", the fill colour is not varied.
k: The dimensions or topics selected by number or name. When fill = "loading", one plot is created per selected dimension or topic; when fill = "loading" and k is not specified, all dimensions or topics are plotted.
fill.label: The label used for the fill colour legend.
ggplot_call: The function used to create the plot. Replace embedding_plot_2d_ggplot_call or pca_hexbin_plot_ggplot_call with your own function to customize the appearance of the plot.
plot_grid_call: When fill = "loading" and multiple topics (k) are selected, this is the function used to arrange the plots into a grid using plot_grid. It should be a function accepting a single argument, plots, a list of ggplot objects.
fill.type: The type of variable mapped to fill colour. The fill colour is not varied when fill.type = "none".
font.size: Font size used in plot.
pcs: The two principal components (PCs) to be plotted, specified by name or number.
n: The maximum number of points to plot. If n is less than the number of rows of fit$L, the rows are subsampled at random. This argument is ignored if Y is provided.
...: Additional arguments passed to pca_from_topics, tsne_from_topics or umap_from_topics. These additional arguments are only used if Y is not provided.
bins: Number of bins used to create hexagonal 2-d histogram. Passed as the “bins” argument to stat_bin_hex.
breaks: To produce the hexagonal histogram, the counts are subdivided into intervals based on breaks. Passed as the “breaks” argument to cut.

Value

A ggplot object.

Details

This is a lightweight interface primarily intended to expedite creation of plots for visualizing the loadings or topic proportions; most of the heavy lifting is done by ‘ggplot2’. The 2-d embedding itself is computed by invoking pca_from_topics, tsne_from_topics or umap_from_topics. For more control over the plot's appearance, the plot can be customized by modifying the ggplot_call and plot_grid_call arguments.

An effective 2-d visualization may also require some fine-tunning of the settings, such as the t-SNE “perplexity”, or the number of samples included in the plot. The PCA, UMAP, t-SNE settings can be controlled by the additional arguments (...). Alternatively, a 2-d embedding may be pre-computed, and passed as argument Y.

Examples

set.seed(1)
data(pbmc_facs)

# Get the Poisson NMF and multinomial topic models fitted to the
# PBMC data.
fit1 <- multinom2poisson(pbmc_facs$fit)
fit2 <- pbmc_facs$fit

# Plot the first two PCs of the loadings matrix (for the
# multinomial topic model, "fit2", the loadings are the topic
# proportions).
subpop <- pbmc_facs$samples$subpop
p1 <- pca_plot(fit1,k = 1)
p2 <- pca_plot(fit2)
p3 <- pca_plot(fit2,fill = "none")
p4 <- pca_plot(fit2,pcs = 3:4,fill = "none")
p5 <- pca_plot(fit2,fill = fit2$L[,1])
p6 <- pca_plot(fit2,fill = subpop)
p7 <- pca_hexbin_plot(fit1)
p8 <- pca_hexbin_plot(fit2)

# \donttest{
# Plot the loadings using t-SNE.
p1 <- tsne_plot(fit1,k = 1)
#> Read the 2000 x 6 data matrix successfully!
#> Using no_dims = 2, perplexity = 100.000000, and theta = 0.100000
#> Computing input similarities...
#> Building tree...
#> Done in 0.50 seconds (sparsity = 0.195510)!
#> Learning embedding...
#> Iteration 50: error is 56.566020 (50 iterations in 1.02 seconds)
#> Iteration 100: error is 49.586311 (50 iterations in 0.70 seconds)
#> Iteration 150: error is 48.779634 (50 iterations in 0.68 seconds)
#> Iteration 200: error is 48.481377 (50 iterations in 0.67 seconds)
#> Iteration 250: error is 48.321772 (50 iterations in 0.68 seconds)
#> Iteration 300: error is 0.499560 (50 iterations in 0.81 seconds)
#> Iteration 350: error is 0.354588 (50 iterations in 0.83 seconds)
#> Iteration 400: error is 0.303966 (50 iterations in 0.81 seconds)
#> Iteration 450: error is 0.280049 (50 iterations in 0.81 seconds)
#> Iteration 500: error is 0.266749 (50 iterations in 0.81 seconds)
#> Iteration 550: error is 0.258490 (50 iterations in 0.80 seconds)
#> Iteration 600: error is 0.252933 (50 iterations in 0.80 seconds)
#> Iteration 650: error is 0.248995 (50 iterations in 0.79 seconds)
#> Iteration 700: error is 0.246062 (50 iterations in 0.78 seconds)
#> Iteration 750: error is 0.243852 (50 iterations in 0.78 seconds)
#> Iteration 800: error is 0.242078 (50 iterations in 0.78 seconds)
#> Iteration 850: error is 0.240709 (50 iterations in 0.78 seconds)
#> Iteration 900: error is 0.239629 (50 iterations in 0.77 seconds)
#> Iteration 950: error is 0.238710 (50 iterations in 0.76 seconds)
#> Iteration 1000: error is 0.237995 (50 iterations in 0.76 seconds)
#> Fitting performed in 15.62 seconds.
p2 <- tsne_plot(fit2)
#> Read the 2000 x 6 data matrix successfully!
#> Using no_dims = 2, perplexity = 100.000000, and theta = 0.100000
#> Computing input similarities...
#> Building tree...
#> Done in 0.48 seconds (sparsity = 0.185092)!
#> Learning embedding...
#> Iteration 50: error is 55.169629 (50 iterations in 0.96 seconds)
#> Iteration 100: error is 48.296393 (50 iterations in 0.68 seconds)
#> Iteration 150: error is 47.207207 (50 iterations in 0.62 seconds)
#> Iteration 200: error is 46.770503 (50 iterations in 0.61 seconds)
#> Iteration 250: error is 46.531120 (50 iterations in 0.61 seconds)
#> Iteration 300: error is 0.483807 (50 iterations in 0.74 seconds)
#> Iteration 350: error is 0.337533 (50 iterations in 0.74 seconds)
#> Iteration 400: error is 0.280061 (50 iterations in 0.74 seconds)
#> Iteration 450: error is 0.251434 (50 iterations in 0.73 seconds)
#> Iteration 500: error is 0.235569 (50 iterations in 0.74 seconds)
#> Iteration 550: error is 0.226006 (50 iterations in 0.72 seconds)
#> Iteration 600: error is 0.219770 (50 iterations in 0.70 seconds)
#> Iteration 650: error is 0.215321 (50 iterations in 0.71 seconds)
#> Iteration 700: error is 0.212055 (50 iterations in 0.70 seconds)
#> Iteration 750: error is 0.209506 (50 iterations in 0.70 seconds)
#> Iteration 800: error is 0.207451 (50 iterations in 0.69 seconds)
#> Iteration 850: error is 0.205801 (50 iterations in 0.69 seconds)
#> Iteration 900: error is 0.204450 (50 iterations in 0.74 seconds)
#> Iteration 950: error is 0.203271 (50 iterations in 0.86 seconds)
#> Iteration 1000: error is 0.202383 (50 iterations in 0.75 seconds)
#> Fitting performed in 14.43 seconds.
p3 <- tsne_plot(fit2,fill = subpop)
#> Read the 2000 x 6 data matrix successfully!
#> Using no_dims = 2, perplexity = 100.000000, and theta = 0.100000
#> Computing input similarities...
#> Building tree...
#> Done in 0.49 seconds (sparsity = 0.184268)!
#> Learning embedding...
#> Iteration 50: error is 54.034875 (50 iterations in 1.06 seconds)
#> Iteration 100: error is 47.811985 (50 iterations in 0.73 seconds)
#> Iteration 150: error is 47.002541 (50 iterations in 0.72 seconds)
#> Iteration 200: error is 46.676555 (50 iterations in 0.72 seconds)
#> Iteration 250: error is 46.488674 (50 iterations in 0.71 seconds)
#> Iteration 300: error is 0.461843 (50 iterations in 0.75 seconds)
#> Iteration 350: error is 0.320256 (50 iterations in 0.76 seconds)
#> Iteration 400: error is 0.267543 (50 iterations in 0.77 seconds)
#> Iteration 450: error is 0.241771 (50 iterations in 0.76 seconds)
#> Iteration 500: error is 0.227141 (50 iterations in 0.77 seconds)
#> Iteration 550: error is 0.217905 (50 iterations in 0.76 seconds)
#> Iteration 600: error is 0.211658 (50 iterations in 0.76 seconds)
#> Iteration 650: error is 0.207181 (50 iterations in 0.77 seconds)
#> Iteration 700: error is 0.203827 (50 iterations in 0.78 seconds)
#> Iteration 750: error is 0.201253 (50 iterations in 0.78 seconds)
#> Iteration 800: error is 0.199169 (50 iterations in 0.79 seconds)
#> Iteration 850: error is 0.197529 (50 iterations in 0.79 seconds)
#> Iteration 900: error is 0.196210 (50 iterations in 0.78 seconds)
#> Iteration 950: error is 0.195116 (50 iterations in 0.78 seconds)
#> Iteration 1000: error is 0.194157 (50 iterations in 0.79 seconds)
#> Fitting performed in 15.54 seconds.

# Plot the loadings using UMAP.
p1 <- umap_plot(fit1,k = 1)
#> 09:21:01 UMAP embedding parameters a = 1.896 b = 0.8006
#> 09:21:01 Read 2000 rows and found 6 numeric columns
#> 09:21:01 Using FNN for neighbor search, n_neighbors = 30
#> 09:21:01 Commencing smooth kNN distance calibration using 4 threads
#>  with target n_neighbors = 30
#> 09:21:02 Initializing from normalized Laplacian + noise (using irlba)
#> 09:21:02 Commencing optimization for 500 epochs, with 74134 positive edges
#> 09:21:04 Optimization finished
p2 <- umap_plot(fit2)
#> 09:21:04 UMAP embedding parameters a = 1.896 b = 0.8006
#> 09:21:04 Read 2000 rows and found 6 numeric columns
#> 09:21:04 Using FNN for neighbor search, n_neighbors = 30
#> 09:21:05 Commencing smooth kNN distance calibration using 4 threads
#>  with target n_neighbors = 30
#> 09:21:05 56 smooth knn distance failures
#> 09:21:05 Found 3 connected components, 
#> falling back to 'spca' initialization with init_sdev = 1
#> 09:21:05 Using 'irlba' for PCA
#> 09:21:05 PCA: 2 components explained 65.39% variance
#> 09:21:05 Scaling init to sdev = 1
#> 09:21:05 Commencing optimization for 500 epochs, with 72844 positive edges
#> 09:21:08 Optimization finished
p3 <- umap_plot(fit2,fill = subpop)
#> 09:21:08 UMAP embedding parameters a = 1.896 b = 0.8006
#> 09:21:08 Read 2000 rows and found 6 numeric columns
#> 09:21:08 Using FNN for neighbor search, n_neighbors = 30
#> 09:21:08 Commencing smooth kNN distance calibration using 4 threads
#>  with target n_neighbors = 30
#> 09:21:08 54 smooth knn distance failures
#> 09:21:08 Found 2 connected components, 
#> falling back to 'spca' initialization with init_sdev = 1
#> 09:21:08 Using 'irlba' for PCA
#> 09:21:08 PCA: 2 components explained 65.4% variance
#> 09:21:08 Scaling init to sdev = 1
#> 09:21:08 Commencing optimization for 500 epochs, with 72822 positive edges
#> 09:21:11 Optimization finished
# }

Arguments

Value

Details

See also

Examples