Caleb

Stats Ph.D.

Bayesian Reduced Rank Regression in Econometrics

Caleb Jin / 2018-05-09


Foreword

After I read this paper, Bayesian Reduced Rank Regression in Econometrics (Geweke 1996), I write down the nodes of the key idea and R code to realize it.

Introduction

In contemporary sociey, a big amount of data are generated and collected more easily and routinely in various academic and industrial areas, such as engineering, politics, B2C e-ccommerce, genomics, etc. Many problems could be cast into statistical problems under the framework of multivariate linear regression model, which is characterized by that both reponse variables and predictors are high dimensionality.

We assume n independent observations of the response yiRq with predictor vector xiRp, i=1,2,,n. Cosider multivariate linear regression model as follows: (1)Y=XC+E, where Y=(y1,y2,,yn)Rn×q is the response matrix, X=(x1,x2,,xn)Rn×p is the design matrix, CRp×q is the coefficient matrix, and E=(e1,e2,,en)Rn×q is the disturbance matrix with ei’s iidNq(0,Σe). We assume Σe=σ2Iq. Therefore, we have YMN(XC,Σe,In).

Bayesian Rank Reduced Regression model

We consider to decompose the coefficient matrix into two part as follows: C=AB, where ARp×r, BRr×q and known rmin(p,q). However, this decomposition is not unique, because with a r×r nonsingular matrix Q,C=AB=AQQ1B=A~B~. In order to indentify it, we further decompose A. A=[Ir,A]. The author assumes that p(A,B)exp(τ22trace{AA+BB}) and σ2IG(a2,b2).

Posterior Distribution

Let a~k and b~k denote the kth column of A and B, respectively. Let aj and bl denote the yth row of A and lth row of B, respectively.

p(Y|A,B,Σe)|Σe|q2exp(12trace{(YXAB)(YXAB)Σe1})=(σ2)nq2exp(12σ2trace{(YXAB)(YXAB)})=(σ2)nq2exp(12σ2trace{(YX(j~)A(j)B)(YX(j~)A(j)B)})×exp(12σ2trace{(x~jajBBajx~j)})×exp(1σ2trace{(x~jajB(YX(j~)A(j)B))})=(σ2)nq2exp(12σ2trace{(YX(j~)A(j)B)(YX(j~)A(j)B)})×exp(12σ2ajBBajx~jx~j)×exp((2)12σ2ajB(YX(j~)A(j)B)x~j).

p(aj|A(j),B,Σe,Y)p(A|B,Σe,Y)p(Y|A,B,Σe)p(A,B)p(Y|A,B,Σe)exp(τ22trace{AA+BB})p(Y|A,B,Σe)exp(τ22j=1rajaj)exp(12σ2ajBBajx~jx~j)×exp((2)12σ2ajB(YX(j~)A(j)B)x~j)×exp(τ22ajaj)=exp{12(aj(σ2BBx~jx~j+Irτ2)aj2ajB(YX(j~)A(j)B)x~jσ2)}=exp{12(ajΣjA1aj2ajΣjA1ΣjAB(YX(j~)A(j)B)x~jσ2)}=exp{12(ajΣjA1aj2ajΣjA1μjA)}exp{12((ajμjA)ΣjA1(ajμjA))}, where ΣjA=(BBx~jx~jσ2+Irτ2)1 and μjA=ΣjAB(YX(j~)A(j)B)x~jσ2.

Hence, aj|A(j),B,Σe,YNr(μjA,ΣjA). p(Y|A,B,Σe)|Σe|q2exp(12trace{(YXAB)Σe1(YXAB)})=(σ2)nq2exp(12σ2trace{(YXAB)(YXAB)})=(σ2)nq2exp(12σ2trace{(Y(l~)XA(B)(l~))(Y(l~)XA(B)(l~))})×exp(12σ2[bl(XA)XAbl2bl(XA)y~l+y~ly~l]).

p(bl|A,(B)(l~),Y,Σe)p(B|A,Y,Σe)p(Y|A,B,Σe)p(A,B)p(Y|A,B,Σe)exp(τ22trace{AA+BB})p(Y|A,B,Σe)exp(τ22l=1rblbl)exp(12σ2[bl(XA)XAbl2bl(XA)y~l+y~ly~l])exp(τ22blbl)=exp(12[bl(XA)XAblσ22bl(XA)y~lσ2+y~ly~lσ2])exp(τ22blbl)=exp{12(bl((XA)XAσ2+Irτ2)bl2bl(XA)y~lσ2)}=exp{12(blΣjB1bl2blΣjB1ΣjB(XA)y~lσ2)}=exp{12(blΣjB1bl2blΣjB1μjB)}exp{12((blμjB)ΣjB1(blμjB))}, where ΣjB=((XA)XAσ2+Irτ2)1 and μjB=ΣjB(XA)y~lσ2.

Hence, bl|A,(B)(l~),Y,ΣeNr(μjB,ΣjB).

We know that the element in kth row and kth column of Σe is σ2,k=1,2,,q. p(σ2|A,B,Y,(Σe)(k)(k~))p(Σe|A,B,Y)p(Y|A,B,Σe)p(Σe)p(Y|A,B,Σe)p(σ2)=(σ2)nq2exp(12σ2trace{(YXAB)(YXAB)})×(σ2)a21exp(b2σ2)=(σ2)nq+a21exp(12σ2(trace{(YXAB)(YXAB)}+b)). Hence, σ2|A,B,Y,(Σe)(k)(k~)IG(nq+a2,12(trace{(YXAB)(YXAB)}+b)).

Gibbs Sampling

The algorithm is easy to construct. Begin with initial values A(0),Σ(0), then for t = 1,2,…

  • Step (1) Given A(j)(t1),B(t1),σ2(t1), draw aj(t) from Nr(μjA,ΣjA),j=r+1,r+2,,p;
  • Step (2) Given A(t),B(l~)(t1),σ2(t1), draw bl(t) from Nr(μjB,ΣjB),l=1,2,,q;
  • Step (3) Given B(t),A(t), draw σ2(t) from IG(a,b).
  • Step (4) Repeat step (1) to step (3) until convergence.

Note that, the step (1) samples the aj from j=r+1, because first part of A is Ir, which is known. Hence, step (1) samples the A and then insert Ir together to construct A.

Model Selection

To this point we have proceeded as if r were known. In most applications this will not be true and so analysis to this point is conditional.When r is unknown, the analysis may be carried out for several alternative values of r. Under the Bayesian framework, model selection can be implemented by using the model posterior probability conditioning the data, p(Mr|y)=p(y|Mr)p(Mr)rMp(y|Mr)p(Mr)=p(y|Mr)rMp(y|Mr), where M=1,2,,min(p,q), and p(Mr)=1card(M).

Hence, as the prior of rank(model) is flat, p(Mr|y) is only determined by marginal likelihood (p(y|Mr)). The model selection problem here is then converted to find out the rank which maximizes the marginal likelihood (p(y|Mr)). In order to calculate the p(y|Mr), I used Laplace and Gelfand and Dey (GD) methods, and then compare with DIC.

Laplace

Let θ=(A,B,σ2) p(Y|Mr)=p(Y|θ,M)p(θ)dAdBdσ2p(Y|θ^,M)p(θ^|M)|(nq)1Σ^M|1/2(2π)(kM/2), where (A^,B^,σ^2)=argmaxp(Y|A,B,σ2,M)p(A,B,σ2|M).

Hence, logp(Y|Mr)logp(Y|θ^,M)+logp(θ^|M)12kMlognq+12|ΣM|+kM2log2π=12(2log(p(Y|A^,B^,σ2^,M))+kMlognq)+C=12(nqlog(2πσ2^)12σ2^trace{(YXC^)(YXC^)}+[r(p+qr)+1]log(nq))+C=12BIC+C(2)=12BIC,as n.

The problem we are facing when we use Laplace here is that we need to find the mode of p(A,B,σ2|Y), in which (A,B,σ2) is high-dimensional. To address this problem, Besag proposed an iterated conditional modes (ICM) algorithm. The ICM obtains the local maximum of the joint posterior by iteratively maximizing the full conditionals as follows:

  • Begin with initial values A(0),Σ(0), then for t = 1,2,…
  • Given A(j)(t1),B(t1),σ2(t1), aj(t)μjA,j=r+1,r+2,,p;
  • Given A(t),B(l~)(t1),σ2(t1), bl(t)μjB,l=1,2,,q;
  • Given B(t),A(t), σ2(t)ba1.

We obtain C(t)=A(t)B(t) and estimate C^ICM=T1t=1TC(t). Put the C^ICM in the Eq.(2) to calculate 12BIC. Hence, the Laplace here is actually propotional to BIC.

DIC

Let θ=(A,B,σ2), D(θ)=2logp(Y|θ,M)=nqlog(2πσ2)+1σ2trace{(YXC)(YXC)}

Then DIC can be computed by DIC2T1t=1TD(θ(t))D(T1t=1Tθ(t)), where θ(t) is a MCMC sample generated from p(θ|Y).

Note that DIC=D(θ¯)+2PD, which is analogous to AIC.

Gelfand & Dey (GD)

Let θ=(A,B,σ2),then the GD estimator is p(Y|M)[T1t=1Tg(θ(t))p(Y|θ(t))p(θ(t))]1, where θ(t) is a MCMC sample generated from p(θ|Y). Define g(θ)=N(θ|θ~,Σ~), where θ~ and Σ~ are MCMC sample mean and variance, respectively. I use formular as follows: logp(Y|M)log[1Tt=1Tg(t)f(t)]1=log[Tt=1Tg(t)f(t)]=logTlog(t=1Tg(t)f(t))(3)=logTlog(t=1Texp(logg(t)logf(t)))=logTlog(t=1Texpct)=logTlog(ec1et=1T(ctc1))=logT[c1+log(t=2T(ctc1))], where g(t)=g(θ),f(t)=p(Y|θ(t))p(θ(t)),ct=logg(t)logf(t).

Note that when calculating logp(Y|M), there is a computation problem in formula (3), exp(logg(t)logf(t)) goes to infinity, due to a very large magnitude of logg(t)logf(t). To solve this problem, I used logT[c1+log(t=2T(ctc1))] to calculate logp(Y|M).

Simulation Study

Data Generation

In the simulation study, my goal is to find out the true model or true rank of coefficient matrix(C) based on Laplace, DIC and GD method. I set n=100,q=12,p=7 and σ2=2. The coefficient matrix is as follows:

C=[100210000011010003200013001000033422100210000011010003200013010003200013001000033422] Hence, the true rank of C is 3.

Generate data based on the model from Eq.(1). For the prior, p(A,B)exp(τ22trace{AA+BB}) and σ2IG(a2,b2), where τ=103 and a=b=1.

After running the MCMC simulation, the result based on 1 replication is shown in Table 1.

Table 1: Model Selection among Laplace, DIC and GD Based on 1 Replication.
Method(log) 1 2 3 4 5 6 7
Laplace -3007 -2535 -2223 -2257 -2285 -2310 -2329
DIC 84300 17779 7302 7317 7325 7363 7405
GD -3062 -2631 -2347 -2397 -2436 -2470 -2495
MSE 1.032 0.186 0.0120 0.014 0.0170 0.0192 0.0217

The MSE in the table is defined as follows: MSE=trace{(C^C)(C^C)}pq. MSE indicates the goodness of fit in the reduced rank regression model. We can observe that the MSE is minimized when the rank is 3. This demonstrates that parameter estimation is good enough. For the Laplace, DIC and GD method, in this case, all of them select the true rank. Given rank(C)=3, the estimated C is as follows:

1.07 0.17 0.01 2.34 -0.89 -0.12 -0.11 -0.08 -0.07 0.03 1.26 -1.10
-0.02 1.00 0.04 -0.14 -0.17 -2.83 2.10 0.10 0.03 0.03 -1.04 3.17
0.13 -0.03 0.99 0.18 -0.02 0.08 0.00 3.06 -2.85 4.03 2.10 2.06
0.94 0.14 0.03 2.05 -0.78 -0.08 -0.11 -0.03 -0.10 0.07 1.13 -0.95
0.04 0.99 0.04 -0.00 -0.21 -2.79 2.05 0.10 0.01 0.04 -0.95 3.05
0.12 1.05 0.00 0.18 -0.29 -2.93 2.14 -0.02 0.12 -0.12 -0.98 3.03
0.08 -0.01 0.98 0.08 0.01 0.03 0.04 3.03 -2.81 3.98 2.01 2.14

In addition, when selected rank is larger than 3, the MSE is not worse than I expect. In other words, the model based on larger model (larger rank in C), the overfiteed models still have a good estimation for the C. The values of DIC, Laplace and GD are not far away from the value at true rank. This is very reasonable for overfitted model. But when selected rank smaller than true rank, the MSE increase significantly, and the corresponding values of DIC, Laplace and GD are far away from the value at true rank. This means the model miss some important variables. The analysis is based on one replication.

In Table 2, the result is based on 1000 replication. I want to see the perfermance of methods for model selection. In Table 2, Laplace and GD always select true rank, however, the DIC tends to select larger model, because the average of selected rank in DIC is about 3.5, whereas, that of Laplace and GD are 3. Besides, for the successful probability of selecting true rank, the Laplace and GD, of course, is 100%, but is 61.7% for DIC. This makes sense because Laplace and GD are approximated and exact calculation for marginal likehihood, respectively. They work well in model fitting. The DIC is analogous to AIC, which tends to select larger model and better for short-term prediction.

Table 2: Comparison among Laplace, DIC, GD Based on 1000 Replication.
Method Mean of Selected rank Selection Probabiliy
Laplace 3 1
DIC 3.525 0.617
GD 3 1

R Code for Bayeian reduced rank regression with DIC used in the simulation study

library(mvtnorm)
library(invgamma)
rm(list=ls())
## Data & Model##
row1 <- c(1,0,0,2,-1,0,0,0,0,0,1,-1)
row2 <- c(0,1,0,0,0,-3,2,0,0,0,-1,3)
row3 <- c(0,0,1,0,0,0,0,3,-3,4,2,2)
C <- rbind(row1,row2,row3,row1,row2,row2,row3)
true.rank <- 3
p <- dim(C)[1]
q <- dim(C)[2]
true.sig2 <- 2
true.SIGe <- diag(true.sig2,q)
n=100
a=b=1
tau2=1e-3
MC.size=5000
burn_in=3000
reptn <- 125
error <- array(NA,dim = c(reptn,p))
DIC <- array(NA,dim = c(reptn,p))
TPR.DIC <- rep(NA,reptn)
TPR.error <- rep(NA,reptn)
hat.rank <- rep(NA,reptn)
v=0
for (i in 1:reptn) {
  t1=Sys.time()
  set.seed(2144+i+125*v)
  X <- matrix(rnorm(n*p,0,1),n,p)
  E <- rmvnorm(n, rep(0,q), true.SIGe, method="chol")
  Y=X%*%C+E
  ## Reduced Rank Regression model ##
  for (r in 1:p) {
    Hat.A <- array(NA,dim = c(p,r,MC.size))
    Hat.B <- array(NA,dim = c(q,r,MC.size))
    Hat.C <- array(NA,dim = c(p,q,MC.size))
    Hat.sig2 <- rep(NA,MC.size)
    # initial values
    hat.C <- coef(lm(Y~X-1))
    if (r==1){
      hat.B <- as.matrix(hat.C[1:r,])
    } else {
      hat.B <- t(hat.C[1:r,])
    }
    if (r==p) {
      hat.A <- diag(1,r)
    } else {
      hat.A <- rbind(diag(1,r),hat.C[(r+1):p,]%*%hat.B%*%solve(crossprod(hat.B)))
    }

    hat.sig2 <- mean(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))  #true.sig2
    ## MCMC ##
    for (jin in 1:MC.size) {
      # Sampling A
      if (r==p){
        Hat.A[,,jin] <- hat.A
      } else {
        for(j in (r+1):p) {
          Sig.A_j <- solve(crossprod(hat.B)*as.numeric(crossprod(X[,j]))/hat.sig2+diag(tau2,r))
          mu.A_j <- Sig.A_j%*%t(hat.B)%*%t(Y-X[,-j]%*%hat.A[-j,]%*%t(hat.B))%*%X[,j]/hat.sig2
          hat.a_j <- rmvnorm(n=1,mean = mu.A_j,sigma = Sig.A_j)
          hat.A[j,] <- hat.a_j
        }
        Hat.A[,,jin] <- hat.A
      }
      # Sampling B
      Sig.B_l <- solve(crossprod(X%*%hat.A)/hat.sig2+diag(tau2,r))
      for(l in 1:q) {
        mu.B_l <- Sig.B_l%*%t(X%*%hat.A)%*%Y[,l]/hat.sig2
        hat.b_l <- t(rmvnorm(n=1,mean = mu.B_l,sigma = Sig.B_l))
        hat.B[l,] <- t(hat.b_l)
      }
      Hat.B[,,jin] <- hat.B
      Hat.C[,,jin] <- hat.A%*%t(hat.B)
      #Sampling Sig2
      hat.sig2 <- rinvgamma(n = 1,shape = (q*n+a)/2,rate = 0.5*(b+sum(diag(tcrossprod(Y-X%*%hat.A%*%t(hat.B))))))
      Hat.sig2[jin] <- hat.sig2
      #print(jin)
    }
    hat.C <- apply(Hat.C[,,-(1:burn_in)],c(1,2),mean)
    hat.sig2 <- mean(Hat.sig2[-(1:burn_in)])
    # print(C)
    D.theta <- rep(NA,MC.size-burn_in)
    j=0
    for (jin in (burn_in+1):MC.size) {
      j=j+1
      D.theta[j] <- n*q*log(2*pi*Hat.sig2[jin])+Hat.sig2[jin]*sum(diag(crossprod(Y-X%*%Hat.C[,,jin])))
    }
    D.hat.theta <- n*q*log(2*pi*hat.sig2)+hat.sig2*sum(diag(crossprod(Y-X%*%hat.C)))
    DIC[i,r] <- 2*mean(D.theta)- D.hat.theta
    error[i,r] <- sum((hat.C-C)^2)/(p*q)
  }
  hat.rank[i] <- which.min(DIC[i,])
  TPR.DIC[i] <- hat.rank[i]==true.rank
  TPR.error[i] <- which.min(error[i,])==true.rank
  print(c(i,hat.rank[i]))
  print(Sys.time()-t1)
}

data.frame(hat.rank=mean(hat.rank),TPR.DIC=mean(TPR.DIC),TPR.ERR=mean(TPR.error))
save(list = ls(), file = paste0('DIC-',v,'.RData'))

Reference

Geweke, John F. 1996. “Bayesian Reduced Rank Regression in Econometrics.” Journal of Econometrics 75 (November). https://doi.org/10.1016/0304-4076(95)01773-9.