Bayesian Reduced Rank Regression in Econometrics
Caleb Jin / 2018-05-09
Foreword
After I read this paper, Bayesian Reduced Rank Regression in Econometrics (Geweke 1996), I write down the nodes of the key idea and R code to realize it.
Introduction
In contemporary sociey, a big amount of data are generated and collected more easily and routinely in various academic and industrial areas, such as engineering, politics, B2C e-ccommerce, genomics, etc. Many problems could be cast into statistical problems under the framework of multivariate linear regression model, which is characterized by that both reponse variables and predictors are high dimensionality.
We assume \(n\) independent observations of the response \({\bf y}_i\in \mathcal{R}^q\) with predictor vector \({\bf x}_i\in\mathcal{R}^p\), \(i=1,2,\ldots,n\). Cosider multivariate linear regression model as follows: \[\begin{eqnarray} {\bf Y}={\bf X}{\bf C}+ {\bf E}, \tag{1} \end{eqnarray}\] where \({\bf Y}=({\bf y}_1,{\bf y}_2,\ldots,{\bf y}_n)^{{\top}}\in\mathcal{R}^{n\times q}\) is the response matrix, \({\bf X}=({\bf x}_1,{\bf x}_2,\ldots,{\bf x}_n)^{{\top}}\in\mathcal{R}^{n\times p}\) is the design matrix, \({\bf C}\in\mathcal{R}^{p\times q}\) is the coefficient matrix, and \({\bf E}=({\bf e}_1,{\bf e}_2,\ldots,{\bf e}_n)^{{\top}}\in\mathcal{R}^{n\times q}\) is the disturbance matrix with \({\bf e}_i\)’s \(\overset{iid}{\sim}\mathcal{N}_q({\bf 0},{\boldsymbol \Sigma}_e)\). We assume \({\boldsymbol \Sigma}_e=\sigma^2{\bf I}_q\). Therefore, we have \({\bf Y}\sim\mathcal{MN}({\bf X}{\bf C},{\boldsymbol \Sigma}_e,{\bf I}_n)\).
Bayesian Rank Reduced Regression model
We consider to decompose the coefficient matrix into two part as follows: \[\begin{eqnarray} {\bf C}={\bf A}{\bf B}^{{\top}}, \end{eqnarray}\] where \({\bf A}\in\mathcal{R}^{p\times r}\), \({\bf B}\in\mathcal{R}^{r\times q}\) and known \(r\leq \min(p,q)\). However, this decomposition is not unique, because with a \(r\times r\) nonsingular matrix \({\bf Q}, {\bf C}={\bf A}{\bf B}^{{\top}}={\bf A}{\bf Q}{\bf Q}^{-1}{\bf B}^{{\top}}=\tilde{{\bf A}}\tilde{{\bf B}}^{{\top}}.\) In order to indentify it, we further decompose \({\bf A}\). \({\bf A}^{{\top}}=[{\bf I}_r, {{\bf A}^*}^{{\top}}]\). The author assumes that \(p({\bf A},{\bf B})\propto\exp\left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right)\) and \(\sigma^2\sim \mathcal{IG}(\frac{a}{2},\frac{b}{2})\).
Posterior Distribution
Let \(\tilde{{\bf a}}_k\) and \(\tilde{{\bf b}}_k\) denote the \(k\)th column of \({\bf A}\) and \({\bf B}\), respectively. Let \({{\bf a}}_j^{{\top}}\) and \({{\bf b}}_l^{{\top}}\) denote the \(y\)th row of \({\bf A}\) and \(l\)th row of \({\bf B}\), respectively.
\[\begin{eqnarray*} &&p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)\\ &\propto& |{\boldsymbol \Sigma}_e|^{-\frac{q}{2}}\exp\left(-\frac{1}{2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}{\boldsymbol \Sigma}_e^{-1}\}\right)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}}) ({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times&\exp\left(-\frac{1}{2\sigma^2}trace\{(\tilde{{\bf x}}_j{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}) \}\right)\\ &\times&\exp\left(\frac{1}{\sigma^2}trace\{(\tilde{{\bf x}}_j{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}})\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}}) ({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times&\exp\left(-\frac{1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\right) \times \exp\left((-2)\frac{-1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\right). \end{eqnarray*}\]
\[\begin{eqnarray*} &&p({\bf a}_j|{\bf A}_{(j)},{\bf B},{\boldsymbol \Sigma}_e,{\bf Y})\\ &\propto& p({\bf A}|{\bf B},{\boldsymbol \Sigma}_e,{\bf Y})\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) p({\bf A},{\bf B})\\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right) \\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} \sum_{j=1}^{r}{\bf a}_j^{{\top}}{\bf a}_j\right)\\ &\propto& \exp\left(-\frac{1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}{\bf B}{\bf a}_j\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\right) \times \exp\left((-2)\frac{-1}{2\sigma^2}{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\right)\\ &\times& \exp \left(-\frac{\tau^2}{2}{\bf a}_j^{{\top}}{\bf a}_j\right)\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}(\sigma^{-2}{\bf B}^{{\top}}{\bf B}\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j +{\bf I}_r\tau^2){\bf a}_j-2{\bf a}_j^{{\top}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1} {\bf a}_j-2{\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}{{\boldsymbol \Sigma}_j^{A}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1} {\bf a}_j-2{\bf a}_j^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}{\boldsymbol \mu}_j^{A}\right)\right\}\\ &\propto& \exp\left\{-\frac{1}{2}\left(({\bf a}_j-{\boldsymbol \mu}_j^{A})^{{\top}}{{\boldsymbol \Sigma}_j^{A}}^{-1}({\bf a}_j-{\boldsymbol \mu}_j^{A}) \right)\right\}, \end{eqnarray*}\] where \({\boldsymbol \Sigma}_j^{A}={({\bf B}^{{\top}}{\bf B}\tilde{{\bf x}}_j^{{\top}}\tilde{{\bf x}}_j\sigma^{-2} +{\bf I}_r\tau^2)}^{-1}\) and \({\boldsymbol \mu}_j^{A}={{\boldsymbol \Sigma}_j^{A}}{\bf B}^{{\top}}({\bf Y}-{\bf X}_{(\tilde{j})}{\bf A}_{(j)} {\bf B}^{{\top}})^{{\top}}\tilde{{\bf x}}_j\sigma^{-2}\).
Hence, \({\bf a}_j|{\bf A}_{(j)},{\bf B},{\boldsymbol \Sigma}_e,{\bf Y}\sim \mathcal{N}_r({\boldsymbol \mu}_j^A,{\boldsymbol \Sigma}_j^A)\). \[\begin{eqnarray*} &&p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)\\ &\propto& |{\boldsymbol \Sigma}_e|^{-\frac{q}{2}}\exp\left(-\frac{1}{2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}{\boldsymbol \Sigma}_e^{-1}({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})\}\right)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})\}\right)\\ &=&(\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{({\bf Y}_{(\tilde{l})} -{\bf X}{\bf A}({\bf B}^{{\top}})_{(\tilde{l})})^{{\top}}({\bf Y}_{(\tilde{l})}-{\bf X}{\bf A}({\bf B}^{{\top}})_{(\tilde{l})})\}\right)\\ &\times& \exp\left(-\frac{1}{2\sigma^2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l+\tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\right]\right). \end{eqnarray*}\]
\[\begin{eqnarray*} &&p({\bf b}_l|{\bf A},({\bf B}^{{\top}})_{(\tilde{l})},{\bf Y},{\boldsymbol \Sigma}_e)\\ &\propto& p({\bf B}|{\bf A},{\bf Y},{\boldsymbol \Sigma}_e) \propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p({\bf A},{\bf B})\\ &\propto & p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right) \\ &\propto& p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e) \exp \left(-\frac{\tau^2}{2} \sum_{l=1}^{r}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &\propto& \exp\left(-\frac{1}{2\sigma^2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l+\tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\right]\right) \exp \left(-\frac{\tau^2}{2}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &=& \exp\left(-\frac{1}{2}\left[{\bf b}_{l}^{{\top}} ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}{\bf b}_{l}\sigma^{-2}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}+ \tilde{{\bf y}}_l^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right]\right) \exp \left(-\frac{\tau^2}{2}{\bf b}_l^{{\top}}{\bf b}_l\right)\\ &=& \exp\left\{-\frac{1}{2}\left({\bf b}_{l}^{{\top}}( ({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}\sigma^{-2}+{\bf I}_r\tau^2){\bf b}_{l}- 2{\bf b}_{l}^{{\top}}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac{1}{2}\left({\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{{\boldsymbol \Sigma}_j^B}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\right)\right\}\\ &=& \exp\left\{-\frac {1}{2}\left({\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\bf b}_{l}- 2{\bf b}_{l}^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}{\boldsymbol \mu}_j^B\right)\right\}\\ &\propto& \exp\left\{-\frac{1}{2}\left(({\bf b}_{l}-{\boldsymbol \mu}_j^B)^{{\top}}{{\boldsymbol \Sigma}_j^B}^{-1}({\bf b}_{l}-{\boldsymbol \mu}_j^B) \right)\right\}, \end{eqnarray*}\] where \({\boldsymbol \Sigma}_j^B= (({\bf X}{\bf A})^{{\top}}{\bf X}{\bf A}\sigma^{-2}+{\bf I}_r\tau^2)^{-1}\) and \({\boldsymbol \mu}_j^B={{\boldsymbol \Sigma}_j^B}({\bf X}{\bf A})^{{\top}}\tilde{{\bf y}}_l\sigma^{-2}\).
Hence, \({\bf b}_l|{\bf A},({\bf B}^{{\top}})_{(\tilde{l})},{\bf Y},{\boldsymbol \Sigma}_e\sim \mathcal{N}_r({\boldsymbol \mu}_j^B,{\boldsymbol \Sigma}_j^B)\).
We know that the element in \(k\)th row and \(k\)th column of \({\boldsymbol \Sigma}_e\) is \(\sigma^2, k=1,2,\ldots,q\). \[\begin{eqnarray*} &&p(\sigma^2|{\bf A},{\bf B},{\bf Y},({\boldsymbol \Sigma}_e)_{(k)(\tilde{k})})\\ &\propto& p({\boldsymbol \Sigma}_e|{\bf A},{\bf B},{\bf Y})\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p({\boldsymbol \Sigma}_e)\propto p({\bf Y}|{\bf A},{\bf B},{\boldsymbol \Sigma}_e)p(\sigma^2)\\ &=& (\sigma^2)^{-\frac{nq}{2}}\exp\left(-\frac{1}{2\sigma^2}trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}\right)\\ &\times & (\sigma^2)^{-\frac{a}{2}-1}\exp\left(-\frac{b}{2\sigma^2}\right)\\ &=& (\sigma^2)^{-\frac{nq+a}{2}-1}\exp\left(-\frac{1}{2\sigma^2}(trace\{ ({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}+b)\right). \end{eqnarray*}\] Hence, \(\sigma^2|{\bf A},{\bf B},{\bf Y},({\boldsymbol \Sigma}_e)_{(k)(\tilde{k})}\sim \mathcal{IG}\left(\frac{nq+a}{2},\frac{1}{2}(trace\{({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})({\bf Y}-{\bf X}{\bf A}{\bf B}^{{\top}})^{{\top}}\}+b)\right)\).
Gibbs Sampling
The algorithm is easy to construct. Begin with initial values \({\bf A}^{(0)}, {\boldsymbol \Sigma}^{(0)}\), then for t = 1,2,…
- Step (1) Given \({\bf A}_{(j)}^{(t-1)},{\bf B}^{{\top}(t-1)}, {\sigma^2}^{(t-1)}\), draw \({\bf a}_j^{(t)}\) from \(\mathcal{N}_r({\boldsymbol \mu}_j^A,{\boldsymbol \Sigma}_j^A),j=r+1,r+2,\ldots,p\);
- Step (2) Given \({\bf A}^{(t)},{\bf B}^{{\top}(t-1)}_{(\tilde{l})}, {\sigma^2}^{(t-1)}\), draw \({\bf b}_l^{(t)}\) from \(\mathcal{N}_r({\boldsymbol \mu}_j^B,{\boldsymbol \Sigma}_j^B),l=1,2,\ldots,q\);
- Step (3) Given \({\bf B}^{(t)}, {\bf A}^{(t)}\), draw \({\sigma^2}^{(t)}\) from \(\mathcal{IG}\left(a^*,b^*\right)\).
- Step (4) Repeat step (1) to step (3) until convergence.
Note that, the step (1) samples the \({\bf a}_j\) from \(j=r+1\), because first part of \({\bf A}\) is \({\bf I}_r\), which is known. Hence, step (1) samples the \({\bf A}^*\) and then insert \({\bf I}_r\) together to construct \({\bf A}\).
Model Selection
To this point we have proceeded as if r were known. In most applications this will not be true and so analysis to this point is conditional.When r is unknown, the analysis may be carried out for several alternative values of r. Under the Bayesian framework, model selection can be implemented by using the model posterior probability conditioning the data, \[ p(M_r|y)=\frac{p(y|M_r)p(M_r)}{\sum_{r\in \mathcal{M}}p(y|M_r)p(M_r)}= \frac{p(y|M_r)}{\sum_{r\in \mathcal{M}}p(y|M_r)}, \] where \(\mathcal{M}={1,2,\ldots,min(p,q)}\), and \(p(M_r)=\frac{1}{card(\mathcal{M})}\).
Hence, as the prior of rank(model) is flat, \(p(M_r|y)\) is only determined by marginal likelihood (\(p(y|M_r)\)). The model selection problem here is then converted to find out the rank which maximizes the marginal likelihood (\(p(y|M_r)\)). In order to calculate the \(p(y|M_r)\), I used Laplace and Gelfand and Dey (GD) methods, and then compare with DIC.
Laplace
Let \({\boldsymbol \theta}= ({\bf A},{\bf B},\sigma^2)\) \[\begin{eqnarray*} p({\bf Y}|M_r) &=& \int \int \int p({\bf Y}|{\boldsymbol \theta},M)p({\boldsymbol \theta}) d{{\bf A}}d{{\bf B}}d{\sigma^2}\\ &\approx& p({\bf Y}|\hat{{\boldsymbol \theta}},M)p(\hat{\boldsymbol \theta}|M)|(nq)^{-1}\hat{\boldsymbol \Sigma}_M|^{1/2} (2\pi)^{(k_M/2)}, \end{eqnarray*}\] where \((\hat{\bf A},\hat{\bf B},\hat\sigma^2)=\arg \max p({\bf Y}|{{\bf A}},{{\bf B}},{\sigma^2},M)p({\bf A},{\bf B},\sigma^2|M)\).
Hence, \[\begin{eqnarray} &&\log p({\bf Y}|M_r) \nonumber\\ &\approx& \log p({\bf Y}|\hat{{\boldsymbol \theta}},M) + \log p(\hat{\boldsymbol \theta}|M) -\frac{1}{2}k_M\log nq +\frac{1}{2}|{\boldsymbol \Sigma}_M| + \frac{k_M}{2}\log 2\pi\nonumber\\ &=&-\frac{1}{2} \left(-2\log(p({\bf Y}|\hat{{\bf A}},\hat{{\bf B}},\hat{\sigma^2},M)) + k_M \log nq\right) +C\nonumber\\ &=&-\frac{1}{2}(nq\log(2\pi\hat{\sigma^2})- \frac{1}{2\hat{\sigma^2}} trace\left\{({\bf Y}-{\bf X}\hat{\bf C})^{{\top}}({\bf Y}-{\bf X}\hat{\bf C}) \right\} \nonumber\\ &&+ [r(p+q-r)+1] \log(nq))+C\nonumber\\ &=& -\frac{1}{2}BIC + C\nonumber\\ &=& -\frac{1}{2}BIC, \text{as } n\rightarrow \infty. \tag{2} \end{eqnarray}\]
The problem we are facing when we use Laplace here is that we need to find the mode of \(p({\bf A},{\bf B},\sigma^2|{\bf Y})\), in which \(({\bf A},{\bf B},\sigma^2)\) is high-dimensional. To address this problem, Besag proposed an iterated conditional modes (ICM) algorithm. The ICM obtains the local maximum of the joint posterior by iteratively maximizing the full conditionals as follows:
- Begin with initial values \({\bf A}^{(0)},{\boldsymbol \Sigma}^{(0)}\), then for t = 1,2,…
- Given \({\bf A}_{(j)}^{(t-1)},{\bf B}^{{\top}(t-1)}, {\sigma^2}^{(t-1)}\), \({\bf a}_j^{(t)}\leftarrow{\boldsymbol \mu}_j^A,j=r+1,r+2,\ldots,p\);
- Given \({\bf A}^{(t)},{\bf B}^{{\top}(t-1)}_{(\tilde{l})}, {\sigma^2}^{(t-1)}\), \({\bf b}_l^{(t)}\leftarrow{\boldsymbol \mu}_j^B,l=1,2,\ldots,q\);
- Given \({\bf B}^{(t)}, {\bf A}^{(t)}\), \({\sigma^2}^{(t)}\leftarrow\frac{b^*}{a^*-1}\).
We obtain \({\bf C}^{(t)}={\bf A}^{(t)}{\bf B}^{{\top}(t)}\) and estimate \(\hat{{\bf C}}_{ICM}=T^{-1}\sum_{t=1}^{T}{\bf C}^{(t)}\). Put the \(\hat{{\bf C}}_{ICM}\) in the Eq.(2) to calculate \(-\frac{1}{2}BIC\). Hence, the Laplace here is actually propotional to BIC.
DIC
Let \({\boldsymbol \theta}=({\bf A},{\bf B},\sigma^2)\), \[\begin{eqnarray*} &&D({\boldsymbol \theta})\\ &=&-2\log p({\bf Y}|{\boldsymbol \theta},M) \\ &=&nq\log(2\pi\sigma^2)+ \frac{1}{\sigma^2} trace\left\{({\bf Y}-{\bf X}{\bf C})^{{\top}}({\bf Y}-{\bf X}{\bf C}) \right\} \end{eqnarray*}\]
Then DIC can be computed by \[ DIC \approx 2T^{-1}\sum_{t=1}^{T}D({\boldsymbol \theta}^{(t)})-D(T^{-1}\sum_{t=1}^{T}{\boldsymbol \theta}^{(t)}), \] where \({\boldsymbol \theta}^{(t)}\) is a MCMC sample generated from \(p({\boldsymbol \theta}|{\bf Y})\).
Note that \(DIC=D(\bar{{\boldsymbol \theta}})+2P_D\), which is analogous to AIC.
Gelfand & Dey (GD)
Let \({\boldsymbol \theta}=({\bf A},{\bf B},\sigma^2)\),then the GD estimator is \[p({\bf Y}|M)\approx\left[T^{-1}\sum_{t=1}^{T}\frac{g({\boldsymbol \theta}^{(t)})}{p({\bf Y}|{\boldsymbol \theta}^{(t)})p({\boldsymbol \theta}^{(t)})}\right]^{-1},\] where \({\boldsymbol \theta}^{(t)}\) is a MCMC sample generated from \(p({\boldsymbol \theta}|{\bf Y})\). Define \(g({\boldsymbol \theta})=N({\boldsymbol \theta}|\tilde{{\boldsymbol \theta}},\tilde{{\boldsymbol \Sigma}})\), where \(\tilde{{\boldsymbol \theta}}\) and \(\tilde{{\boldsymbol \Sigma}}\) are MCMC sample mean and variance, respectively. I use formular as follows: \[\begin{eqnarray} \log p({\bf Y}|M)&\approx& \log \left[\frac{1}{T}\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}\right]^{-1}\nonumber\\ &=&\log \left[\frac{T}{\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}}\right]\nonumber\\ &=&\log T - \log \left(\sum_{t=1}^{T}\frac{g^{(t)}}{f^{(t)}}\right)\nonumber\\ &=&\log T - \log \left(\sum_{t=1}^{T}\exp (\log g^{(t)}-\log f^{(t)})\right)\tag{3}\\ &=&\log T - \log \left(\sum_{t=1}^{T}\exp c_t\right)\nonumber\\ &=&\log T - \log \left(e^{ c_{1} }e^{\sum_{t=1}^{T} (c_t-c_{1})}\right)\nonumber\\ &=&\log T - \left[c_1 + \log \left(\sum_{t=2}^{T} (c_t-c_{1})\right)\right]\nonumber, \end{eqnarray}\] where \(g^{(t)}=g({\boldsymbol \theta}), f^{(t)}=p({\bf Y}|{\boldsymbol \theta}^{(t)})p({\boldsymbol \theta}^{(t)}), c_t = \log g^{(t)}-\log f^{(t)}\).
Note that when calculating \(\log p({\bf Y}|M)\), there is a computation problem in formula (3), \(\exp (\log g^{(t)}-\log f^{(t)})\) goes to infinity, due to a very large magnitude of \(\log g^{(t)}-\log f^{(t)}\). To solve this problem, I used \(\log T - \left[c_1 + \log \left(\sum_{t=2}^{T} (c_t-c_{1})\right)\right]\) to calculate \(\log p({\bf Y}|M)\).
Simulation Study
Data Generation
In the simulation study, my goal is to find out the true model or true rank of coefficient matrix(\({\bf C}\)) based on Laplace, DIC and GD method. I set \(n=100,q=12,p=7\) and \(\sigma^2=2\). The coefficient matrix is as follows:
\[{\bf C}=\left[ \begin{array}{@{}*{12}{c}@{}} 1 & 0 & 0 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & -3 & 4 & 2 & 2\\ 1 & 0 & 0 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 1 & 0 & 0 & 0 & -3 & 2 & 0 & 0 & 0 & -1 & 3\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 &-3 &4 & 2 & 2 \end{array} \right]\] Hence, the true rank of \({\bf C}\) is 3.
Generate data based on the model from Eq.(1). For the prior, \(p({\bf A},{\bf B})\propto\exp\left(-\frac{\tau^2}{2} trace\{ {\bf A}^{{\top}}{\bf A}+{\bf B}^{{\top}}{\bf B}\}\right)\) and \(\sigma^2\sim \mathcal{IG}(\frac{a}{2},\frac{b}{2})\), where \(\tau=10^{-3}\) and \(a=b=1\).
After running the MCMC simulation, the result based on 1 replication is shown in Table 1.
Method(\(\log\)) | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Laplace | -3007 | -2535 | -2223 | -2257 | -2285 | -2310 | -2329 |
DIC | 84300 | 17779 | 7302 | 7317 | 7325 | 7363 | 7405 |
GD | -3062 | -2631 | -2347 | -2397 | -2436 | -2470 | -2495 |
MSE | 1.032 | 0.186 | 0.0120 | 0.014 | 0.0170 | 0.0192 | 0.0217 |
The MSE in the table is defined as follows: \[ MSE = \frac{trace\{(\hat {\bf C}-{\bf C})^{{\top}}(\hat {\bf C}-{\bf C})\}}{pq}. \] MSE indicates the goodness of fit in the reduced rank regression model. We can observe that the MSE is minimized when the rank is 3. This demonstrates that parameter estimation is good enough. For the Laplace, DIC and GD method, in this case, all of them select the true rank. Given rank\(({\bf C})=3\), the estimated \({\bf C}\) is as follows:
1.07 | 0.17 | 0.01 | 2.34 | -0.89 | -0.12 | -0.11 | -0.08 | -0.07 | 0.03 | 1.26 | -1.10 |
-0.02 | 1.00 | 0.04 | -0.14 | -0.17 | -2.83 | 2.10 | 0.10 | 0.03 | 0.03 | -1.04 | 3.17 |
0.13 | -0.03 | 0.99 | 0.18 | -0.02 | 0.08 | 0.00 | 3.06 | -2.85 | 4.03 | 2.10 | 2.06 |
0.94 | 0.14 | 0.03 | 2.05 | -0.78 | -0.08 | -0.11 | -0.03 | -0.10 | 0.07 | 1.13 | -0.95 |
0.04 | 0.99 | 0.04 | -0.00 | -0.21 | -2.79 | 2.05 | 0.10 | 0.01 | 0.04 | -0.95 | 3.05 |
0.12 | 1.05 | 0.00 | 0.18 | -0.29 | -2.93 | 2.14 | -0.02 | 0.12 | -0.12 | -0.98 | 3.03 |
0.08 | -0.01 | 0.98 | 0.08 | 0.01 | 0.03 | 0.04 | 3.03 | -2.81 | 3.98 | 2.01 | 2.14 |
In addition, when selected rank is larger than 3, the MSE is not worse than I expect. In other words, the model based on larger model (larger rank in \({\bf C}\)), the overfiteed models still have a good estimation for the \({\bf C}\). The values of DIC, Laplace and GD are not far away from the value at true rank. This is very reasonable for overfitted model. But when selected rank smaller than true rank, the MSE increase significantly, and the corresponding values of DIC, Laplace and GD are far away from the value at true rank. This means the model miss some important variables. The analysis is based on one replication.
In Table 2, the result is based on 1000 replication. I want to see the perfermance of methods for model selection. In Table 2, Laplace and GD always select true rank, however, the DIC tends to select larger model, because the average of selected rank in DIC is about 3.5, whereas, that of Laplace and GD are 3. Besides, for the successful probability of selecting true rank, the Laplace and GD, of course, is 100%, but is 61.7% for DIC. This makes sense because Laplace and GD are approximated and exact calculation for marginal likehihood, respectively. They work well in model fitting. The DIC is analogous to AIC, which tends to select larger model and better for short-term prediction.
Method | Mean of Selected rank | Selection Probabiliy |
---|---|---|
Laplace | 3 | 1 |
DIC | 3.525 | 0.617 |
GD | 3 | 1 |
R Code for Bayeian reduced rank regression with DIC used in the simulation study
library(mvtnorm)
library(invgamma)
rm(list=ls())
## Data & Model##
row1 <- c(1,0,0,2,-1,0,0,0,0,0,1,-1)
row2 <- c(0,1,0,0,0,-3,2,0,0,0,-1,3)
row3 <- c(0,0,1,0,0,0,0,3,-3,4,2,2)
C <- rbind(row1,row2,row3,row1,row2,row2,row3)
true.rank <- 3
p <- dim(C)[1]
q <- dim(C)[2]
true.sig2 <- 2
true.SIGe <- diag(true.sig2,q)
n=100
a=b=1
tau2=1e-3
MC.size=5000
burn_in=3000
reptn <- 125
error <- array(NA,dim = c(reptn,p))
DIC <- array(NA,dim = c(reptn,p))
TPR.DIC <- rep(NA,reptn)
TPR.error <- rep(NA,reptn)
hat.rank <- rep(NA,reptn)
v=0
for (i in 1:reptn) {
t1=Sys.time()
set.seed(2144+i+125*v)
X <- matrix(rnorm(n*p,0,1),n,p)
E <- rmvnorm(n, rep(0,q), true.SIGe, method="chol")
Y=X%*%C+E
## Reduced Rank Regression model ##
for (r in 1:p) {
Hat.A <- array(NA,dim = c(p,r,MC.size))
Hat.B <- array(NA,dim = c(q,r,MC.size))
Hat.C <- array(NA,dim = c(p,q,MC.size))
Hat.sig2 <- rep(NA,MC.size)
# initial values
hat.C <- coef(lm(Y~X-1))
if (r==1){
hat.B <- as.matrix(hat.C[1:r,])
} else {
hat.B <- t(hat.C[1:r,])
}
if (r==p) {
hat.A <- diag(1,r)
} else {
hat.A <- rbind(diag(1,r),hat.C[(r+1):p,]%*%hat.B%*%solve(crossprod(hat.B)))
}
hat.sig2 <- mean(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B)))) #true.sig2
## MCMC ##
for (jin in 1:MC.size) {
# Sampling A
if (r==p){
Hat.A[,,jin] <- hat.A
} else {
for(j in (r+1):p) {
Sig.A_j <- solve(crossprod(hat.B)*as.numeric(crossprod(X[,j]))/hat.sig2+diag(tau2,r))
mu.A_j <- Sig.A_j%*%t(hat.B)%*%t(Y-X[,-j]%*%hat.A[-j,]%*%t(hat.B))%*%X[,j]/hat.sig2
hat.a_j <- rmvnorm(n=1,mean = mu.A_j,sigma = Sig.A_j)
hat.A[j,] <- hat.a_j
}
Hat.A[,,jin] <- hat.A
}
# Sampling B
Sig.B_l <- solve(crossprod(X%*%hat.A)/hat.sig2+diag(tau2,r))
for(l in 1:q) {
mu.B_l <- Sig.B_l%*%t(X%*%hat.A)%*%Y[,l]/hat.sig2
hat.b_l <- t(rmvnorm(n=1,mean = mu.B_l,sigma = Sig.B_l))
hat.B[l,] <- t(hat.b_l)
}
Hat.B[,,jin] <- hat.B
Hat.C[,,jin] <- hat.A%*%t(hat.B)
#Sampling Sig2
hat.sig2 <- rinvgamma(n = 1,shape = (q*n+a)/2,rate = 0.5*(b+sum(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))))
Hat.sig2[jin] <- hat.sig2
#print(jin)
}
hat.C <- apply(Hat.C[,,-(1:burn_in)],c(1,2),mean)
hat.sig2 <- mean(Hat.sig2[-(1:burn_in)])
# print(C)
D.theta <- rep(NA,MC.size-burn_in)
j=0
for (jin in (burn_in+1):MC.size) {
j=j+1
D.theta[j] <- n*q*log(2*pi*Hat.sig2[jin])+Hat.sig2[jin]*sum(diag(crossprod(Y-X%*%Hat.C[,,jin])))
}
D.hat.theta <- n*q*log(2*pi*hat.sig2)+hat.sig2*sum(diag(crossprod(Y-X%*%hat.C)))
DIC[i,r] <- 2*mean(D.theta)- D.hat.theta
error[i,r] <- sum((hat.C-C)^2)/(p*q)
}
hat.rank[i] <- which.min(DIC[i,])
TPR.DIC[i] <- hat.rank[i]==true.rank
TPR.error[i] <- which.min(error[i,])==true.rank
print(c(i,hat.rank[i]))
print(Sys.time()-t1)
}
data.frame(hat.rank=mean(hat.rank),TPR.DIC=mean(TPR.DIC),TPR.ERR=mean(TPR.error))
save(list = ls(), file = paste0('DIC-',v,'.RData'))
Reference
Geweke, John F. 1996. “Bayesian Reduced Rank Regression in Econometrics.” Journal of Econometrics 75 (November). https://doi.org/10.1016/0304-4076(95)01773-9.