Prediction example

This vignette demonstrates mvsusieR in the context of prediction. We first simulate three phenotypes (\(y_1\), \(y_2\), \(y_3\)) and genotypes at \(P = 1000\) loci (\(X\)) for \(N = 500\) individuals. Then, the goal is to predict phenotypes from genotypes for \(N_{test} = 100\) individuals, training our model on the remaining \(N_{training} = 400\) individuals.

library(mvsusieR)
# Loading required package: mashr
# Loading required package: ashr
# Loading required package: susieR
set.seed(1234)
options(stringsAsFactors = FALSE)

The data-set

The first step involves simulating phenotypic and genotypic data.

N = 500
P = 1000
true_eff <- 2
X = matrix(sample(c(0, 1, 2), size=N*P, replace=T), nrow=N, ncol=P)
beta1 = beta2 = beta3 = rep(0, P)
beta1[1:true_eff] = runif(true_eff)
beta2[1:true_eff] = runif(true_eff)
beta3[1:true_eff] = runif(true_eff)
y1 = X %*% beta1 + rnorm(N)
y2 = X %*% beta2 + rnorm(N)
y3 = X %*% beta3 + rnorm(N)

The phenotypic data was simulated assuming the same 2 causal variables affect each of the traits with different effects.

which(beta1 != 0)
# [1] 1 2
which(beta2 != 0)
# [1] 1 2
which(beta3 != 0)
# [1] 1 2

head(beta1)
# [1] 0.9679794 0.1423120 0.0000000 0.0000000 0.0000000 0.0000000
head(beta2)
# [1] 0.3893104 0.9137142 0.0000000 0.0000000 0.0000000 0.0000000
head(beta3)
# [1] 0.3161626 0.9683988 0.0000000 0.0000000 0.0000000 0.0000000

Prediction with mvsusieR

For starters, we assume there are at most 10 causal variables (i.e., set L = 10), select the first 100 individuals as the training set, and derive the prior from data:

L <- 10
test_num <- 1:100
y <- cbind(y1[-test_num], y2[-test_num], y3[-test_num])
prior_covar <- create_mash_prior(sample_data = list(X=X[-test_num, ],Y=y,residual_variance=cov(y)), max_mixture_len=-1)

Then, we fit the mvsusie model to the training set.

m <- mvsusie(X[-test_num, ], y, L=L, prior_variance=prior_covar)
# Warning in mvsusie_core(data, s_init, L, prior_variance, prior_weights, :
# precompute_covariances option is set to FALSE by default to save memory
# usage with MASH prior. The computation can be a lot slower as a result. It is
# recommended that you try setting it to TRUE, see if there is a memory usage
# issue and only switch back if it is a problem.
# Iteration 1 delta = Inf
# Iteration 2 delta = 20.5309565923064
# Iteration 3 delta = 1.62692156138473
# Iteration 4 delta = 0.730218996353187
# Iteration 5 delta = 0.417400330157079
# Iteration 6 delta = 0.27236154822458
# Iteration 7 delta = 0.192774166812114
# Iteration 8 delta = 0.144124401491354
# Iteration 9 delta = 0.112074494497392
# Iteration 10 delta = 0.0897717674538399
# Iteration 11 delta = 0.0729216707716205
# Iteration 12 delta = 0.746616399812865

Finally, we use the parameter estimates from the training set to predict phenotypes in the test set. The accuracy of prediction is calculated as the correlation between true and predicted phenotypes.

pred <- predict.mvsusie(m, X[test_num, ])

results <- matrix(c(cor(pred[, 1], y1[test_num, ]), cor(pred[, 2], y2[test_num, ]), cor(pred[, 3], y3[test_num, ])), nrow=3, ncol=1)
results
#           [,1]
# [1,] 0.6095684
# [2,] 0.5205873
# [3,] 0.7057224

Session information

Here are some details about the computing environment, including the versions of R, and the R packages, used to generate these results.

sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Linux Mint 20
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
# LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] mvsusieR_0.0.3.0436 susieR_0.10.1       mashr_0.2.43       
# [4] ashr_2.2-51        
# 
# loaded via a namespace (and not attached):
#  [1] progress_1.2.2     softImpute_1.4     tidyselect_1.1.0   xfun_0.13         
#  [5] purrr_0.3.4        reshape2_1.4.3     lattice_0.20-40    colorspace_2.0-0  
#  [9] vctrs_0.3.6        generics_0.1.0     htmltools_0.4.0    yaml_2.2.1        
# [13] rlang_0.4.10       mixsqp_0.3-46      pkgdown_1.5.1      pillar_1.4.7      
# [17] glue_1.4.2         matrixStats_0.58.0 lifecycle_1.0.0    plyr_1.8.6        
# [21] stringr_1.4.0      munsell_0.5.0      gtable_0.3.0       mvtnorm_1.1-1     
# [25] memoise_1.1.0      evaluate_0.14      knitr_1.28         invgamma_1.1      
# [29] irlba_2.3.3        Rcpp_1.0.6         scales_1.1.1       rmeta_3.0         
# [33] desc_1.2.0         truncnorm_1.0-8    abind_1.4-5        fs_1.3.2          
# [37] hms_1.0.0          flashr_0.6-7       ggplot2_3.3.3      digest_0.6.27     
# [41] stringi_1.5.3      dplyr_1.0.2        grid_3.6.3         rprojroot_2.0.2   
# [45] tools_3.6.3        magrittr_2.0.1     tibble_3.0.5       crayon_1.4.1      
# [49] pkgconfig_2.0.3    MASS_7.3-51.5      ellipsis_0.3.1     Matrix_1.2-18     
# [53] prettyunits_1.1.1  SQUAREM_2021.1     reshape_0.8.8      assertthat_0.2.1  
# [57] rmarkdown_2.3      R6_2.5.0           compiler_3.6.3

Fabio Morgante

2021-04-09

The data-set

Prediction with mvsusieR

Session information