This vignette demonstrates mvsusieR in the context of prediction. We first simulate three phenotypes (\(y_1\), \(y_2\), \(y_3\)) and genotypes at \(P = 1000\) loci (\(X\)) for \(N = 500\) individuals. Then, the goal is to predict phenotypes from genotypes for \(N_{test} = 100\) individuals, training our model on the remaining \(N_{training} = 400\) individuals.
library(mvsusieR) # Loading required package: mashr # Loading required package: ashr # Loading required package: susieR set.seed(1234) options(stringsAsFactors = FALSE)
The first step involves simulating phenotypic and genotypic data.
N = 500 P = 1000 true_eff <- 2 X = matrix(sample(c(0, 1, 2), size=N*P, replace=T), nrow=N, ncol=P) beta1 = beta2 = beta3 = rep(0, P) beta1[1:true_eff] = runif(true_eff) beta2[1:true_eff] = runif(true_eff) beta3[1:true_eff] = runif(true_eff) y1 = X %*% beta1 + rnorm(N) y2 = X %*% beta2 + rnorm(N) y3 = X %*% beta3 + rnorm(N)
The phenotypic data was simulated assuming the same 2 causal variables affect each of the traits with different effects.
which(beta1 != 0) # [1] 1 2 which(beta2 != 0) # [1] 1 2 which(beta3 != 0) # [1] 1 2 head(beta1) # [1] 0.9679794 0.1423120 0.0000000 0.0000000 0.0000000 0.0000000 head(beta2) # [1] 0.3893104 0.9137142 0.0000000 0.0000000 0.0000000 0.0000000 head(beta3) # [1] 0.3161626 0.9683988 0.0000000 0.0000000 0.0000000 0.0000000
For starters, we assume there are at most 10 causal variables (i.e., set L = 10
), select the first 100 individuals as the training set, and derive the prior from data:
L <- 10 test_num <- 1:100 y <- cbind(y1[-test_num], y2[-test_num], y3[-test_num]) prior_covar <- create_mash_prior(sample_data = list(X=X[-test_num, ],Y=y,residual_variance=cov(y)), max_mixture_len=-1)
Then, we fit the mvsusie model to the training set.
m <- mvsusie(X[-test_num, ], y, L=L, prior_variance=prior_covar) # Warning in mvsusie_core(data, s_init, L, prior_variance, prior_weights, : # precompute_covariances option is set to FALSE by default to save memory # usage with MASH prior. The computation can be a lot slower as a result. It is # recommended that you try setting it to TRUE, see if there is a memory usage # issue and only switch back if it is a problem. # Iteration 1 delta = Inf # Iteration 2 delta = 20.5309565923064 # Iteration 3 delta = 1.62692156138473 # Iteration 4 delta = 0.730218996353187 # Iteration 5 delta = 0.417400330157079 # Iteration 6 delta = 0.27236154822458 # Iteration 7 delta = 0.192774166812114 # Iteration 8 delta = 0.144124401491354 # Iteration 9 delta = 0.112074494497392 # Iteration 10 delta = 0.0897717674538399 # Iteration 11 delta = 0.0729216707716205 # Iteration 12 delta = 0.746616399812865
Finally, we use the parameter estimates from the training set to predict phenotypes in the test set. The accuracy of prediction is calculated as the correlation between true and predicted phenotypes.
Here are some details about the computing environment, including the versions of R, and the R packages, used to generate these results.
sessionInfo() # R version 3.6.3 (2020-02-29) # Platform: x86_64-pc-linux-gnu (64-bit) # Running under: Linux Mint 20 # # Matrix products: default # BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 # LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 # # locale: # [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C # [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 # [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 # [7] LC_PAPER=en_US.UTF-8 LC_NAME=C # [9] LC_ADDRESS=C LC_TELEPHONE=C # [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C # # attached base packages: # [1] stats graphics grDevices utils datasets methods base # # other attached packages: # [1] mvsusieR_0.0.3.0436 susieR_0.10.1 mashr_0.2.43 # [4] ashr_2.2-51 # # loaded via a namespace (and not attached): # [1] progress_1.2.2 softImpute_1.4 tidyselect_1.1.0 xfun_0.13 # [5] purrr_0.3.4 reshape2_1.4.3 lattice_0.20-40 colorspace_2.0-0 # [9] vctrs_0.3.6 generics_0.1.0 htmltools_0.4.0 yaml_2.2.1 # [13] rlang_0.4.10 mixsqp_0.3-46 pkgdown_1.5.1 pillar_1.4.7 # [17] glue_1.4.2 matrixStats_0.58.0 lifecycle_1.0.0 plyr_1.8.6 # [21] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 mvtnorm_1.1-1 # [25] memoise_1.1.0 evaluate_0.14 knitr_1.28 invgamma_1.1 # [29] irlba_2.3.3 Rcpp_1.0.6 scales_1.1.1 rmeta_3.0 # [33] desc_1.2.0 truncnorm_1.0-8 abind_1.4-5 fs_1.3.2 # [37] hms_1.0.0 flashr_0.6-7 ggplot2_3.3.3 digest_0.6.27 # [41] stringi_1.5.3 dplyr_1.0.2 grid_3.6.3 rprojroot_2.0.2 # [45] tools_3.6.3 magrittr_2.0.1 tibble_3.0.5 crayon_1.4.1 # [49] pkgconfig_2.0.3 MASS_7.3-51.5 ellipsis_0.3.1 Matrix_1.2-18 # [53] prettyunits_1.1.1 SQUAREM_2021.1 reshape_0.8.8 assertthat_0.2.1 # [57] rmarkdown_2.3 R6_2.5.0 compiler_3.6.3