Conditional moments tests where proposed by Tauchen (1985) and Newey (1985). As these tests are quite alike Lagrange multiplier (or score) tests, which is, along with the likelihood ratio and the Wald tests one of the three “classical” tests. These three tests have in common to test a set of hypothesis which enables to go from a unconstrained (or larger) model to a constrained (or smaller) model:
For the special case where the hypothesis can be writen as a linear combination of the coefficients, the unconstrained model can always be parametrized in a way such that the likelihood function of the larger model is: ln L(β1, β2), with under H0: β2 = 0 and the larger model reduces to the smaller one. For the score test, the statistic is based on the gradient of the log-likelihood computed with the parameters obtained by estimating the constrained model: β̂1c, 0, which is:
$$ \frac{\partial \ln L}{\partial \beta}(\hat{\beta}_1^c, 0) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta_1}(\hat{\beta}_1^c, 0) \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) = \left( \begin{array}{c} 0 \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) $$
As an example, consider a linear or log-linear gaussian model for which one wishes to test the normality hypothesis. For the smaller model, we have:
yn = α + βxn + ϵn with ϵn ∼ N(0, σϵ)
for the linear model and:
ln yn = α + βxn + ϵn with ϵn ∼ N(0, σϵ)
for the log-linear model. A potential larger model consists on taking a Box-Cox transformation of y, denoted y(λ):
y(λ) = α + βxn + ϵn with ϵn ∼ N(0, σϵ)
with:
$$ y^{(\lambda)} = \left\{ \begin{array}{lcl} \frac{y_n^\lambda - 1}{\lambda} & \mbox{if} & \lambda \neq 0 \\ \ln y_n & \mbox{if} & \lambda = 0 \end{array} \right. $$
which obviously reduce to the smaller log-linear model for λ = 0. For λ = 1, y(λ) = y − 1 and the larger model reduce to the linear model except that 1 should be substracted from the left hand side of the equation, which means that the intercept of the linear model fitted by ordinary least squares should be reduced by 1.
The log-likelihood of the larger model is:
$$ \ln L (\beta, \sigma, \lambda) = -\frac{N}{2}\ln 2\pi - N \ln \sigma - (1 - \lambda) \sum_{n=1}^N \ln y_n - \frac{1}{2\sigma ^ 2}\sum_{n=1}^N\left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2 $$
The gradient of the log-likelihood is:
$$ \frac{\partial \ln L}{\partial \theta}(\theta) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta}\\ \frac{\partial \ln L}{\partial \sigma}\\ \frac{\partial \ln L}{\partial \lambda}\\ \end{array} \right) = \left( \begin{array}{c} \frac{1}{\sigma ^ 2} \sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) x_n\\ -\frac{N}{\sigma} + \frac{1}{\sigma^3}\sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2\\ \sum_{n=1} ^ N \ln y_n - \frac{1}{\sigma ^ 2}\sum_{n=1}^N \left(y_n ^ {(\lambda)} - \beta^\top x_n\right) \left(y_n^{(\lambda)} \ln y_n - \frac{y_n ^ {(\lambda)} - \ln y_n}{\lambda}\right) \end{array} \right) $$
For the constrained linear (λ̂i = 1) and log-linear (λ̂o = 0) models, we have respectively:
$$ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_i) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \left(1 + y(\ln y - 1)\right) \end{array} \right) $$
$$ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_o) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \ln y_n^2 \end{array} \right) $$
The following function returns the log-likelihood for a box-cox
model, given a vector of parameters coef
, a matrix of
covariates X
and a vector of response y
. It
has further arguments:
sum
: if FALSE
, invidual contributions to
the log-likelihood (a vector of length N) and to the gradient (a matrix of
dimensions N × K is
returned,gradient
: if TRUE
the analytical gradient
is computed and returned as an attribute,hessian
: if TRUE
the analytical hessian is
computed and returned as an attribute.lnL_bc <- function(coef, X, y, sum = FALSE, gradient = FALSE, hessian = FALSE){
y <- y
N <- length(y)
K <- length(coef) - 3
beta <- coef[1:(K + 1)]
sigma <- coef[K + 2]
lambda <- coef[K + 3]
bX <- as.numeric(X %*% beta)
if (lambda == 0){
bc <- log(y)
bc_l <- 1 / 2 * log(y) ^ 2
bc_ll <- 1 / 3 * log(y) ^ 3
}
else{
bc <- (y ^ lambda - 1) / lambda
bc_l <- bc * log(y) - (bc - log(y)) / lambda
bc_ll <- bc_l * log(y) - (bc_l * lambda - (bc - log(y))) / lambda ^ 2
}
e <- bc - bX
lnL <- - 1 / 2 * log(2 * pi) - log(sigma) - (1 - lambda) * log(y) -
1 / (2 * sigma ^ 2) * e ^ 2
if (gradient){
g_beta <- 1 / sigma ^ 2 * e * X
g_sigma <- - 1 / sigma + 1 / (sigma ^ 3) * e ^ 2
g_lambda <- log(y) - 1 / (sigma ^ 2) * e * bc_l
G <- cbind(g_beta, g_sigma, g_lambda)
}
if (hessian){
h_bb <- - 1 / sigma ^ 2 * crossprod(X)
h_bs <- apply(- 2 / sigma ^ 3 * X * e, 2, sum)
h_bl <- apply( 1 / sigma ^ 2 * bc_l * X, 2, sum)
h_ss <- sum(1 / sigma ^ 2 - 3 / sigma ^ 4 * e ^ 2)
h_sl <- sum(2 / sigma ^ 3 * e * bc_l)
h_ll <- sum(- 1 / sigma ^ 2 * (e * bc_ll + bc_l ^ 2))
H <- rbind(cbind(h_bb, sigma = h_bs, lambda = h_bl),
sigma = c(h_bs, h_ss, h_sl),
lambda = c(h_bl, h_sl, h_ll))
}
if (sum){
lnL <- sum(lnL)
if (gradient) G <- apply(G, 2, sum)
}
if (gradient) attr(lnL, "gradient") <- G
if (hessian) attr(lnL, "hessian") <- H
lnL
}
Consider as an example the cars
data set which is
provided by base R. The response is dist
(the stoping distance) and the only covariate is speed
(the
speed of the car). The smaller linear and log-linear models can be
computed using lm
:
as stated previously, the linear model is a special case of the box-cox model if one is substracted from the intercept.
To use the lnL_bc
function, we need to extract the
response vector and the covariates matrix:
Finally we can evaluate the log-likelihood of the box-cox model for
the two special cases of the linear and the log-linear models using the
coefficients obtained using lm
(σ is a parameter in the
log-likelihood, it is extracted from the lm
models using
the provided sigma2
function which differs from the
sigma
function by the fact that the deviance is divided by
the number of observations and not the number of degrees of freedom of
the regression).
sigma2 <- function(x) sigma(x) * sqrt(df.residual(x) / nobs(x))
coefs_lin <- c(coefs_lin, sigma = sigma2(lin_lm), lambda = 1)
coefs_log <- c(coef(log_lm), sigma = sigma2(log_lm), lambda = 0)
lnL_lin <- lnL_bc(coefs_lin, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_lin
## [1] -205.386
## attr(,"gradient")
## (Intercept) speed I(speed^2) g_sigma g_lambda
## 1.041376e-15 1.951564e-14 4.263673e-13 6.175616e-16 -2.184459e+01
lnL_log
## [1] -202.3903
## attr(,"gradient")
## (Intercept) speed I(speed^2) g_sigma g_lambda
## -2.936210e-13 -3.541334e-12 -4.704148e-11 2.620126e-13 2.956768e+01
According to the log-likelihood, the log-linear model fits fairly
better the data than the linear one. As expected, all the elements of
the gradient vector except the last one is 0. The derivative of the
log-likelihood with respect with λ is negative for the linear model
and positive for the log-linear model. This means that the
log-likelihood should increase for some values of λ between 0 and 1. To fit the
box-cox log-likelihood model, we use the maxLik::maxLik
function. It is a good general practise to provide “good” (and
different) starting values to unsures that the non-linear optimization
process converges, and that it converges to a global maximum. We
therefore run twice maxLik
, with starting values
corresponding to the linear and to the non-linear model:
library("maxLik")
bc_lin <- maxLik(lnL_bc, start = coefs_lin, X = X, y = y,
sum = TRUE, gradient = TRUE, hessian = TRUE)
bc_log <- maxLik(lnL_bc, start = coefs_log, X = X, y = y,
sum = TRUE, gradient = TRUE, hessian = TRUE)
We first check that the fitted model is the same for the two sets of starting values:
## [,1] [,2]
## (Intercept) 0.213367880 0.213370726
## speed 0.582112891 0.582111088
## I(speed^2) -0.005530642 -0.005530664
## sigma 1.365042140 1.365033776
## lambda 0.372872290 0.372870677
and we then summarize the fitted model:
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 15 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -197.3795
## 5 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## (Intercept) 0.213368 1.351837 0.158 0.87459
## speed 0.582113 0.232726 2.501 0.01237 *
## I(speed^2) -0.005531 0.006108 -0.905 0.36524
## sigma 1.365042 0.650610 2.098 0.03590 *
## lambda 0.372872 0.131795 2.829 0.00467 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
As expected, the fitted value of λ lies between 0 and 1 and is statistically different from 0 and from 1. Note also that the log-likelihood of the box-cox model (-197.38) is larger than the one of the log-linear model (-202.39).
If the hypothesis corresponding to the smaller model are true, the distribution of the score vector is:
g(θ̃) ∼ N(0, I(θ))
Were I is the information matrix. Therefore if H0 is true:
g(θ̃)⊤I(θ)−1g(θ̃)
is a χ2 with J degrees of freedom (J being the number of hypothesis, one in our example). $I(\theta)=-\mbox{E}\left(\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\theta)\right)$ needs to be estimate and two natural estimators are:
The G matrix is obtained
using the lnL_bc
function with sum = FALSE
and
gradient = TRUE
and the H matrix using
hessian = TRUE
.
lnL_lin <- lnL_bc(coefs_lin, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
G_lin <- attr(lnL_lin, "gradient")
H_lin <- attr(lnL_lin, "hessian")
G_log <- attr(lnL_log, "gradient")
H_log <- attr(lnL_log, "hessian")
g_lin <- apply(G_lin, 2, sum)
g_log <- apply(G_log, 2, sum)
The test statistics can then be computed using any of the two estimators of the information matrix:
as.numeric(g_lin %*% solve(- H_lin) %*% g_lin)
## [1] 22.48473
as.numeric(g_lin %*% solve(crossprod(G_lin)) %*% g_lin)
## [1] 18.08425
as.numeric(g_log %*% solve(- H_log) %*% g_log)
## [1] 8.797853
as.numeric(g_log %*% solve(crossprod(G_log)) %*% g_log)
## [1] 9.428458
The two variants of the same test statistics are very close whatever the approximation of the information matrix used. The critical value of a χ2 with one degree of freedom at the 5% level being 3.84, the two hypothesis of linear and log normality are strongly rejected.
Note that g(θ̃) = G(θ̃)⊤ι where ι is a vector of one of length N. Therefore the outer product variant of the test is:
ι⊤G(θ̃)[G(θ̃)⊤G(θ̃)]−1G(θ̃)⊤ι = ι⊤MG(θ̃)ι
where MG(θ̃) = G(θ̃)[G(θ̃)⊤G(θ̃)]−1G(θ̃)⊤ is the projection matrix on the columns of G(θ̃). MG(θ̃)ι is therefore the fitted values of the regression of a column of ones with the columns of G(θ̃) and, as MG(θ̃) is idempotent, the test statistic is the sum of the fitted values of this regression. The (uncetered as there is no intercept in this regression) R2 is this sum of the fitted values divided by the total variation of the response, which is equal to N (the response being a vector of ones of length N). Therefore, the test statistic can be obtained as N times the R2 of the regression of 1 one on the columns of G:
The conditional moment test approach consists on constructing a vector of moments which are 0 under H0. Denoting μn such a vector, we have, for the normality hypothesis:
$$ \mu_n(\theta) = \left( \begin{array}{c} \mu_{3n}\\ \mu_{4n} \end{array} \right)= \left(\begin{array}{c} \mbox{E}(\epsilon_n ^ 3) \\ \mbox{E}(\epsilon_n ^ 4 - 3 \sigma ^ 4) \end{array}\right) = 0 $$
Denoting ϵ̂n = yn − β̃⊤xn or ϵ̂n = ln yn − β̃⊤xn the residuals of the linear or the log-linear model, the empirical moments are:
$$ m_n(\tilde{\theta}) = \left( \begin{array}{c} \hat{\epsilon}_n ^ 3 \\ \hat{\epsilon}_n ^ 4 - 3 \sigma ^ 4 \end{array} \right) $$
Define : M(θ̃)⊤ = (m1, m2, …, mN) an N × r matrix for which each line is the moment vector for on observation and $W(\theta)= \frac{1}{N}\sum_n \mbox{E}\left(\frac{\partial m_n}{\partial \theta}(\theta)\right)$ a K × r matrix containing the derivatives of the empirical moments with the parameters of the model. Defining:
Q = [M(θ̃) − G(θ̃)I(θ)−1W(θ)]⊤[M(θ̃) − G(θ̃)I(θ)−1W(θ)]
the test statistic is:
m(θ̃)⊤Q−1m(θ̃)
and under Ho follows a χ2 with r (the length of m) degrees of freedom.
Q plays the role of the information matrix for the score test. As previously, different flavours of the test can be obtained depending on how Q is estimated:
In this latter case, the test statistic can also be obtained as N times the R2 of the regression of a column of one on the columns of G and M. The following function performs the test.
cm_norm <- function(x, type = c("analytical", "opg", "reg")){
type <- match.arg(type)
X <- model.matrix(x)
y <- model.response(model.frame(x))
N <- length(y)
K <- length(coef(x))
coefs <- c(coef(x), sigma = sigma2(x))
beta <- coefs[1:K]
sigma <- coefs[K + 1]
epsilon <- as.numeric(y - X %*% beta)
G <- cbind(1 / sigma ^ 2 * X * epsilon,
sigma = - 1 / sigma + 1 / sigma ^ 3 * epsilon ^ 2)
M <- cbind(asym = epsilon ^ 3, kurt = epsilon ^ 4 - 3 * sigma ^ 4)
m <- apply(M, 2, sum)
if (type == "analytical"){
I <- - rbind(cbind(- 1 / sigma ^ 2 * crossprod(X),
sigma = - 2 / sigma ^ 3 * apply(X * epsilon, 2, sum)),
sigma = c(- 2 / sigma ^ 3 * apply(X * epsilon, 2, sum),
1 / sigma ^ 2 - 3 / sigma ^ 4 * sum(epsilon ^ 2)))
W <- - rbind(cbind(asym = - 3 * apply(epsilon ^ 2 * X, 2, sum),
kurt = - 4 * apply(epsilon ^ 3 * X, 2, sum)),
sigma = c(0, - 12 * N / sigma ^ 3))
}
if (type == "opg"){
I <- crossprod(G)
W <- crossprod(G, M)
}
if (type != "reg"){
Q <- crossprod(M - G %*% solve(I) %*% W)
stat <- as.numeric(m %*% solve(Q) %*% m)
}
else stat <- summary(lm(rep(1, N) ~ G + M - 1))$r.squared * N
stat
}
For the linear model, we get:
cm_norm(lin_lm, "analytical")
## [1] 12.2857
cm_norm(lin_lm, "opg")
## [1] 13.41273
cm_norm(lin_lm, "reg")
## [1] 13.41273
and the normality hypothesis is therefore strongly rejected. For the log-linear model:
cm_norm(log_lm, "analytical")
## [1] 0.07940664
cm_norm(log_lm, "opg")
## [1] 0.6255081
cm_norm(log_lm, "reg")
## [1] 0.6255081
on the contrary, the normality hypothesis is not rejected.