Last updated: 2019-08-01

Checks: 2 0

Knit directory: susie-mixture/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Repository version: 746720f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	034283b	Zhengyang Fang	2019-07-22	Build site.
html	6d9cbaa	Zhengyang Fang	2019-07-22	Build site.
html	2a474f9	Zhengyang Fang	2019-07-22	Build site.
html	c72a707	Zhengyang Fang	2019-07-22	Build site.
html	8092e5a	Zhengyang Fang	2019-07-19	Build site.
html	5d3e3ef	Zhengyang Fang	2019-07-11	Build site.
html	3eee187	Zhengyang Fang	2019-07-03	Build site.
Rmd	3954e9b	Zhengyang Fang	2019-07-03	wflow_publish(c(dir()))
html	f303e14	Zhengyang Fang	2019-07-01	Build site.
Rmd	0af8e3b	Zhengyang Fang	2019-07-01	wflow_publish(c(dir()))
html	def2b27	Zhengyang Fang	2019-06-28	Build site.
html	e5c3213	Zhengyang Fang	2019-06-28	Build site.
Rmd	dfc09f2	Zhengyang Fang	2019-06-28	wflow_publish(“ridge_derivation.Rmd”)
html	05f778b	Zhengyang Fang	2019-06-28	Build site.
html	a6fdfd8	Zhengyang Fang	2019-06-27	Build site.
Rmd	ecd9a3d	Zhengyang Fang	2019-06-27	wflow_publish(“ridge_derivation.Rmd”)

Part I. Background

General theory of variational inference

Assume the observation to be \(x=(x_1,\dots,x_n)\), and the latent variable to be \(z=(z_1,\dots,z_m)\). We are interested in posterior distribution \(p(z|x)\). We approximate it with a family of functions \(q\), and minimize the KL-divergence:

\[ \begin{aligned} KL(q,p)&=\mathbb E_q\log\frac{q(z)}{p(z|x)}=\mathbb E_q\log q(z)-\mathbb E_q\log p(x,z)+\mathbb E_q\log p(x)\\ &=\log p(x)-[\mathbb E_q\log p(x,z)-\mathbb E_q\log q(z)]\\ &:=\log p(x)-ELBO(q), \end{aligned} \]

where we define the evidence lowerbound

\[ ELBO(q)=\mathbb E_q\log p(x,z)-\mathbb E_q\log q(z). \]

As \(\log p(x)\) is constant, minimizing the KL-divergence is equivalent to maximizing the ELBO(q) over \(q\).

Mean field method

Now we focus on the following setting, assume the function family s.t.

\[ q(z_1,\dots,z_m)=\prod_{j=1}^mq(z_j). \] i.e. those latent variables are independent.

We introduce a coordinate ascend inference framework to maximize ELBO

We have \[ p(z_{1:m},x)=p(x)\cdot\prod_{j=1}^m p(z_j|z_{1:(j-1)},x). \] Note that the order of the index of \(z\) does not really matter. We can arbitrarily change the order of those latent variables.

And also

\[ \mathbb E_q \log q(z_{1:m})=\sum_{j=1}^m\mathbb E_{q_j}\log q(z_j). \]

Therefore,

\[ \begin{aligned} ELBO(q)&=\mathbb E_q\log p(x,z)-\mathbb E_q\log q(z)\\ &=\mathbb E_q\log\left( p(x)\cdot\prod_{j=1}^m p(z_j|z_{1:(j-1)},x)\right)-\sum_{j=1}^m\mathbb E_{q_j}\log q(z_j)\\ &=\log p(x)+\sum_{i=1}^m \mathbb E_q\log p(z_j|z_{1:(j-1)},x)-\sum_{j=1}^m\mathbb E_{q_j}\log q(z_j). \end{aligned} \] And the order of the index of \(z\) does not matter. We can arbitrarily swap them. Specifically, if we want to investigate \(q(z_k),1\leq k\leq m\), we can swap \(z_k\) and \(z_m\), and let \(z_k\) appears in only one term of \(\mathbb E_q\log p(z_j|z_{1:(j-1)},x)\), i.e. equivalent to \(k=m\). Then the only term containing \(z_k\) will be \(\mathbb E_q\log p(z_k|z_{-k},x)\).

As we maximize \(ELBO(q)\) and only focus on some \(z_k\), i.e. fix the other part of this function family \(q(z_{-k})\) to be constant, then we have \[ ELBO(q(z_k))=\mathbb E_q\log p(z_k|z_{-k},x)-\mathbb E_{q_k}\log q(z_k)+const \]

Take the restriction into account

\[ \int q(z_k)dz_k=1. \]

The Lagrangian function

\[ \mathcal L_k=\int q(z_k)\mathbb E_{q(z_{-k})}\log p(z_k|z_{-k},x)dz_k-\int q(z_k)\log q(z_k)dz_k-\lambda\left(\int q(z_k)dz_k-1\right). \]

Then we take a derivative w.r.t \(q(z_k)\) (imagine \(q(z_k)\) to be a vector - actually it has an infinite dimension)

\[ \frac{d \mathcal L_k}{dq(z_k)}=\mathbb E_{q(z_{-k})}\log p(z_k|z_{-k},x)-\log q(z_k)-1-\lambda. \] Then \[ q(z_k)=\exp\{\mathbb E_{q(z_{-k})}\log p(z_k|z_{-k},x)\}\cdot\exp\{1+\lambda\} \]

Hence it leads to a coordinate ascend updating step

\[ q^\star(z_k)\propto \exp\{\mathbb E_{q(z_{-k})}\log p(z_k|z_{-k},x)\}\propto\exp\{\mathbb E_{q(z_{-k})}\log p(z_k,z_{-k},x)\}. \]

We use mean field assumption \(q(\beta_1,\dots,\beta_p)=\prod_{i=1}^p q(\beta_i)\). And we assume \(q\) to be Gaussian, i.e. \(\beta_i\sim N(\mu_{1i},\sigma^2_{1i})\).

Part II. Derive the coordinate ascend updating step for ridge regression

Assume \(\sigma^2,\sigma^2_b\) are known. For a specific \(1\leq k\leq p\), the updating step will be

\[ q^\star(\beta_k)\propto\exp\{\mathbb E_{q(\beta_{-k})}\log p(\beta_k|\beta_{-k},\textbf X,\textbf y)\}. \]

Let \(\textbf r_k=\textbf y-\textbf X_{-k}\beta_{-k}\), we have \(\textbf r_k\sim N\left(\textbf y-\textbf X_{-k}\mu_{-k},\textbf X_{-k}\textbf D_{-k}\textbf X_{-k}^T\right)\), where \(\textbf D_{-k}=diag\left\{\sigma_{11}^2,\dots,\sigma^2_{1(k-1)},\sigma^2_{1(k+1)},\dots,\sigma_{1p}^2\right\}\). For simplicity, we let \(\bar {\textbf r}_k=\textbf y-\textbf X_{-k}\mu_{-k}\).

Since \[ p(\beta_k|\beta_{-k},\textbf y,\textbf X)\propto p(\beta_k)\cdot L(\beta_k;\beta_{-k},\textbf X,\textbf y), \]

then

\[ \begin{aligned} \log p(\beta_k|\beta_{-k},\textbf X,\textbf y)&=const+\log N(\beta_k;0,\sigma^2\sigma_b^2)+\log N(\textbf r_k-\textbf X_k\beta_k;0,\sigma^2I)\\ &=const-\frac{\beta_k^2}{2\sigma^2\sigma^2_b}-\frac1{2\sigma^2}\|\textbf r_k-\textbf X_k\beta_k\|^2_2. \end{aligned} \]

Under \(q_{-k}\), we have

\[ \textbf r_k-\textbf X_k\beta_k\sim N(\textbf y-\textbf X_{-k}\mu_{-k}-\textbf X_k\beta_k,\textbf X_{-k}\textbf D_{-k}\textbf X_{-k}^T) \]

Since \(\textbf X_{-k}\textbf D_{-k}\textbf X_{-k}^T\) is positive semi-definite, we have decomposition

\[ \textbf X_{-k}\textbf D_{-k}\textbf X_{-k}^T=\textbf P^T\Lambda \textbf P, \] where \(P\) is orthogonal and \(\Lambda\) is diagnal. Then

\[ \|\textbf r_k-\textbf X_k\beta_k\|^2_2=\|\textbf P(\textbf r_k-\textbf X_k\beta_k)\|^2_2. \] And

\[ \textbf P(\textbf r_k-\textbf X_k\beta_k)\sim N(\textbf P(\bar {\textbf r}_k-\textbf X_k\beta_k),\Lambda). \]

Thus

\[ \mathbb E\|\textbf P(\textbf r_k-\textbf X_k\beta_k)\|^2_2=trace(\Lambda)+\|P(\bar{\textbf r}_k-\textbf X_k\beta_k)\|_2^2=trace(\Lambda)+\|\bar {\textbf r}_k-\textbf X_k\beta_k\|_2^2. \]

Therefore

\[ \begin{aligned} q^\star(\beta_k)=&\exp\{\mathbb E_{q(z_{-k})}\log p(\beta_k|\beta_{-k},\textbf y,\textbf X)\}\\ =&\exp\left\{const-\frac{\beta_k^2}{2\sigma^2\sigma^2_b}-\frac1{2\sigma^2}\mathbb E\|\textbf r_k-\textbf X_k\beta_k\|^2_2\right\}\\ =&\exp\left\{const-\frac{\beta_k^2}{2\sigma^2\sigma^2_b}-\frac1{2\sigma^2}\left(trace(\Lambda)+\|\bar {\textbf r}_k-\textbf X_k\beta_k\|_2^2\right)\right\}\\ \propto &\exp\left\{-\frac1{2\sigma^2}\left[(1/\sigma_b^2+\|\textbf X_k\|_2^2)\beta_k^2-2\bar {\textbf r}_k^T\textbf X_k\beta_k\right]\right\}\\ \propto &\exp\left\{-\frac1{2\sigma^2/(1/\sigma_b^2+\|\textbf X_k\|_2^2)}\left(\beta_k-\frac{\bar {\textbf r}_k^T\textbf X_k}{1/\sigma_b^2+\|\textbf X_k\|_2^2}\right)^2\right\} \end{aligned} \] Thus the updated result \(q^\star (\beta_k)\) indicates

\[ \beta_k\sim N\left(\frac{\bar {\textbf r}_k^T\textbf X_k}{1/\sigma_b^2+\|\textbf X_k\|_2^2},\frac{\sigma^2}{1/\sigma_b^2+\|\textbf X_k\|_2^2}\right). \] That is to say, in the updating step, it should be \(\hat\mu_{1k}=\frac{\bar {\textbf r}_k^T\textbf X_k}{1/\sigma_b^2+\|\textbf X_k\|_2^2}\), \(\hat\sigma_{1k}^2=\frac{\sigma^2}{1/\sigma_b^2+\|\textbf X_k\|_2^2}\). Actually, we don’t have to update \(\hat\sigma_{1k}^2\) because it will be constant.

Derive mean field method for ridge regression

Zhengyang Fang

June 27, 2019

Part I. Background

General theory of variational inference

Mean field method

Part II. Derive the coordinate ascend updating step for ridge regression