Bayesian Reduced Rank Regression in Econometrics
Caleb Jin / 2018-05-09
Foreword
After I read this paper, Bayesian Reduced Rank Regression in Econometrics (Geweke 1996), I write down the nodes of the key idea and R code to realize it.
Introduction
In contemporary sociey, a big amount of data are generated and collected more easily and routinely in various academic and industrial areas, such as engineering, politics, B2C e-ccommerce, genomics, etc. Many problems could be cast into statistical problems under the framework of multivariate linear regression model, which is characterized by that both reponse variables and predictors are high dimensionality.
We assume independent observations of the response with predictor vector , . Cosider multivariate linear regression model as follows: where is the response matrix, is the design matrix, is the coefficient matrix, and is the disturbance matrix with ’s . We assume . Therefore, we have .
Bayesian Rank Reduced Regression model
We consider to decompose the coefficient matrix into two part as follows: where , and known . However, this decomposition is not unique, because with a nonsingular matrix In order to indentify it, we further decompose . . The author assumes that and .
Posterior Distribution
Let and denote the th column of and , respectively. Let and denote the th row of and th row of , respectively.
where and .
Hence, .
where and .
Hence, .
We know that the element in th row and th column of is . Hence, .
Gibbs Sampling
The algorithm is easy to construct. Begin with initial values , then for t = 1,2,…
- Step (1) Given , draw from ;
- Step (2) Given , draw from ;
- Step (3) Given , draw from .
- Step (4) Repeat step (1) to step (3) until convergence.
Note that, the step (1) samples the from , because first part of is , which is known. Hence, step (1) samples the and then insert together to construct .
Model Selection
To this point we have proceeded as if r were known. In most applications this will not be true and so analysis to this point is conditional.When r is unknown, the analysis may be carried out for several alternative values of r. Under the Bayesian framework, model selection can be implemented by using the model posterior probability conditioning the data, where , and .
Hence, as the prior of rank(model) is flat, is only determined by marginal likelihood (). The model selection problem here is then converted to find out the rank which maximizes the marginal likelihood (). In order to calculate the , I used Laplace and Gelfand and Dey (GD) methods, and then compare with DIC.
Laplace
Let where .
Hence,
The problem we are facing when we use Laplace here is that we need to find the mode of , in which is high-dimensional. To address this problem, Besag proposed an iterated conditional modes (ICM) algorithm. The ICM obtains the local maximum of the joint posterior by iteratively maximizing the full conditionals as follows:
- Begin with initial values , then for t = 1,2,…
- Given , ;
- Given , ;
- Given , .
We obtain and estimate . Put the in the Eq.(2) to calculate . Hence, the Laplace here is actually propotional to BIC.
DIC
Let ,
Then DIC can be computed by where is a MCMC sample generated from .
Note that , which is analogous to AIC.
Gelfand & Dey (GD)
Let ,then the GD estimator is where is a MCMC sample generated from . Define , where and are MCMC sample mean and variance, respectively. I use formular as follows: where .
Note that when calculating , there is a computation problem in formula (3), goes to infinity, due to a very large magnitude of . To solve this problem, I used to calculate .
Simulation Study
Data Generation
In the simulation study, my goal is to find out the true model or true rank of coefficient matrix() based on Laplace, DIC and GD method. I set and . The coefficient matrix is as follows:
Hence, the true rank of is 3.
Generate data based on the model from Eq.(1). For the prior, and , where and .
After running the MCMC simulation, the result based on 1 replication is shown in Table 1.
Method() | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Laplace | -3007 | -2535 | -2223 | -2257 | -2285 | -2310 | -2329 |
DIC | 84300 | 17779 | 7302 | 7317 | 7325 | 7363 | 7405 |
GD | -3062 | -2631 | -2347 | -2397 | -2436 | -2470 | -2495 |
MSE | 1.032 | 0.186 | 0.0120 | 0.014 | 0.0170 | 0.0192 | 0.0217 |
The MSE in the table is defined as follows: MSE indicates the goodness of fit in the reduced rank regression model. We can observe that the MSE is minimized when the rank is 3. This demonstrates that parameter estimation is good enough. For the Laplace, DIC and GD method, in this case, all of them select the true rank. Given rank, the estimated is as follows:
1.07 | 0.17 | 0.01 | 2.34 | -0.89 | -0.12 | -0.11 | -0.08 | -0.07 | 0.03 | 1.26 | -1.10 |
-0.02 | 1.00 | 0.04 | -0.14 | -0.17 | -2.83 | 2.10 | 0.10 | 0.03 | 0.03 | -1.04 | 3.17 |
0.13 | -0.03 | 0.99 | 0.18 | -0.02 | 0.08 | 0.00 | 3.06 | -2.85 | 4.03 | 2.10 | 2.06 |
0.94 | 0.14 | 0.03 | 2.05 | -0.78 | -0.08 | -0.11 | -0.03 | -0.10 | 0.07 | 1.13 | -0.95 |
0.04 | 0.99 | 0.04 | -0.00 | -0.21 | -2.79 | 2.05 | 0.10 | 0.01 | 0.04 | -0.95 | 3.05 |
0.12 | 1.05 | 0.00 | 0.18 | -0.29 | -2.93 | 2.14 | -0.02 | 0.12 | -0.12 | -0.98 | 3.03 |
0.08 | -0.01 | 0.98 | 0.08 | 0.01 | 0.03 | 0.04 | 3.03 | -2.81 | 3.98 | 2.01 | 2.14 |
In addition, when selected rank is larger than 3, the MSE is not worse than I expect. In other words, the model based on larger model (larger rank in ), the overfiteed models still have a good estimation for the . The values of DIC, Laplace and GD are not far away from the value at true rank. This is very reasonable for overfitted model. But when selected rank smaller than true rank, the MSE increase significantly, and the corresponding values of DIC, Laplace and GD are far away from the value at true rank. This means the model miss some important variables. The analysis is based on one replication.
In Table 2, the result is based on 1000 replication. I want to see the perfermance of methods for model selection. In Table 2, Laplace and GD always select true rank, however, the DIC tends to select larger model, because the average of selected rank in DIC is about 3.5, whereas, that of Laplace and GD are 3. Besides, for the successful probability of selecting true rank, the Laplace and GD, of course, is 100%, but is 61.7% for DIC. This makes sense because Laplace and GD are approximated and exact calculation for marginal likehihood, respectively. They work well in model fitting. The DIC is analogous to AIC, which tends to select larger model and better for short-term prediction.
Method | Mean of Selected rank | Selection Probabiliy |
---|---|---|
Laplace | 3 | 1 |
DIC | 3.525 | 0.617 |
GD | 3 | 1 |
R Code for Bayeian reduced rank regression with DIC used in the simulation study
library(mvtnorm)
library(invgamma)
rm(list=ls())
## Data & Model##
row1 <- c(1,0,0,2,-1,0,0,0,0,0,1,-1)
row2 <- c(0,1,0,0,0,-3,2,0,0,0,-1,3)
row3 <- c(0,0,1,0,0,0,0,3,-3,4,2,2)
C <- rbind(row1,row2,row3,row1,row2,row2,row3)
true.rank <- 3
p <- dim(C)[1]
q <- dim(C)[2]
true.sig2 <- 2
true.SIGe <- diag(true.sig2,q)
n=100
a=b=1
tau2=1e-3
MC.size=5000
burn_in=3000
reptn <- 125
error <- array(NA,dim = c(reptn,p))
DIC <- array(NA,dim = c(reptn,p))
TPR.DIC <- rep(NA,reptn)
TPR.error <- rep(NA,reptn)
hat.rank <- rep(NA,reptn)
v=0
for (i in 1:reptn) {
t1=Sys.time()
set.seed(2144+i+125*v)
X <- matrix(rnorm(n*p,0,1),n,p)
E <- rmvnorm(n, rep(0,q), true.SIGe, method="chol")
Y=X%*%C+E
## Reduced Rank Regression model ##
for (r in 1:p) {
Hat.A <- array(NA,dim = c(p,r,MC.size))
Hat.B <- array(NA,dim = c(q,r,MC.size))
Hat.C <- array(NA,dim = c(p,q,MC.size))
Hat.sig2 <- rep(NA,MC.size)
# initial values
hat.C <- coef(lm(Y~X-1))
if (r==1){
hat.B <- as.matrix(hat.C[1:r,])
} else {
hat.B <- t(hat.C[1:r,])
}
if (r==p) {
hat.A <- diag(1,r)
} else {
hat.A <- rbind(diag(1,r),hat.C[(r+1):p,]%*%hat.B%*%solve(crossprod(hat.B)))
}
hat.sig2 <- mean(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B)))) #true.sig2
## MCMC ##
for (jin in 1:MC.size) {
# Sampling A
if (r==p){
Hat.A[,,jin] <- hat.A
} else {
for(j in (r+1):p) {
Sig.A_j <- solve(crossprod(hat.B)*as.numeric(crossprod(X[,j]))/hat.sig2+diag(tau2,r))
mu.A_j <- Sig.A_j%*%t(hat.B)%*%t(Y-X[,-j]%*%hat.A[-j,]%*%t(hat.B))%*%X[,j]/hat.sig2
hat.a_j <- rmvnorm(n=1,mean = mu.A_j,sigma = Sig.A_j)
hat.A[j,] <- hat.a_j
}
Hat.A[,,jin] <- hat.A
}
# Sampling B
Sig.B_l <- solve(crossprod(X%*%hat.A)/hat.sig2+diag(tau2,r))
for(l in 1:q) {
mu.B_l <- Sig.B_l%*%t(X%*%hat.A)%*%Y[,l]/hat.sig2
hat.b_l <- t(rmvnorm(n=1,mean = mu.B_l,sigma = Sig.B_l))
hat.B[l,] <- t(hat.b_l)
}
Hat.B[,,jin] <- hat.B
Hat.C[,,jin] <- hat.A%*%t(hat.B)
#Sampling Sig2
hat.sig2 <- rinvgamma(n = 1,shape = (q*n+a)/2,rate = 0.5*(b+sum(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))))
Hat.sig2[jin] <- hat.sig2
#print(jin)
}
hat.C <- apply(Hat.C[,,-(1:burn_in)],c(1,2),mean)
hat.sig2 <- mean(Hat.sig2[-(1:burn_in)])
# print(C)
D.theta <- rep(NA,MC.size-burn_in)
j=0
for (jin in (burn_in+1):MC.size) {
j=j+1
D.theta[j] <- n*q*log(2*pi*Hat.sig2[jin])+Hat.sig2[jin]*sum(diag(crossprod(Y-X%*%Hat.C[,,jin])))
}
D.hat.theta <- n*q*log(2*pi*hat.sig2)+hat.sig2*sum(diag(crossprod(Y-X%*%hat.C)))
DIC[i,r] <- 2*mean(D.theta)- D.hat.theta
error[i,r] <- sum((hat.C-C)^2)/(p*q)
}
hat.rank[i] <- which.min(DIC[i,])
TPR.DIC[i] <- hat.rank[i]==true.rank
TPR.error[i] <- which.min(error[i,])==true.rank
print(c(i,hat.rank[i]))
print(Sys.time()-t1)
}
data.frame(hat.rank=mean(hat.rank),TPR.DIC=mean(TPR.DIC),TPR.ERR=mean(TPR.error))
save(list = ls(), file = paste0('DIC-',v,'.RData'))
Reference
Geweke, John F. 1996. “Bayesian Reduced Rank Regression in Econometrics.” Journal of Econometrics 75 (November). https://doi.org/10.1016/0304-4076(95)01773-9.