--- title: cmtest, an R package for conditional moments tests author: Yves Croissant date: 2021/03/10 bibliography: ../inst/REFERENCES.bib output: bookdown::html_document2: toc: true toc_float: true bookdown::pdf_document2: number_sections: true toc: false vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{cmtest: an R package for conditional moments tests} %\VignetteEngine{knitr::rmarkdown} --- Conditional moments tests where proposed by @TAUC:85 and @NEWE:85. As these tests are quite alike Lagrange multiplier (or score) tests, which is, along with the likelihood ratio and the Wald tests one of the three "classical" tests. These three tests have in common to test a set of hypothesis which enables to go from a unconstrained (or larger) model to a constrained (or smaller) model: - the Wald test is based on the values of the vector implied the restrictions evaluated for the constrained model, - the likelihood ratio is based on the comparison of the fits of both models (measured by the value of the log-likelihood function), - the score test is based on the value of the score (ie the vector of the first derivatives) of the log-likelihood of the unconstrained models using the estimates of the constrained model. # The score (or Lagrange multiplier) test For the special case where the hypothesis can be writen as a linear combination of the coefficients, the unconstrained model can always be parametrized in a way such that the likelihood function of the larger model is: $\ln L (\beta_1, \beta_2)$, with under H~0~: $\beta_2 = 0$ and the larger model reduces to the smaller one. For the score test, the statistic is based on the gradient of the log-likelihood computed with the parameters obtained by estimating the constrained model: $\hat{\beta}_1 ^c, 0$, which is: $$ \frac{\partial \ln L}{\partial \beta}(\hat{\beta}_1^c, 0) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta_1}(\hat{\beta}_1^c, 0) \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) = \left( \begin{array}{c} 0 \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) $$ As an example, consider a linear or log-linear gaussian model for which one wishes to test the normality hypothesis. For the smaller model, we have: $$ y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) $$ for the linear model and: $$ \ln y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) $$ for the log-linear model. A potential larger model consists on taking a Box-Cox transformation of $y$, denoted $y^{(\lambda)}$: $$ y^{(\lambda)} = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) $$ with: $$ y^{(\lambda)} = \left\{ \begin{array}{lcl} \frac{y_n^\lambda - 1}{\lambda} & \mbox{if} & \lambda \neq 0 \\ \ln y_n & \mbox{if} & \lambda = 0 \end{array} \right. $$ which obviously reduce to the smaller log-linear model for $\lambda = 0$. For $\lambda = 1$, $y ^ {(\lambda)} = y - 1$ and the larger model reduce to the linear model except that $1$ should be substracted from the left hand side of the equation, which means that the intercept of the linear model fitted by ordinary least squares should be reduced by 1. The log-likelihood of the larger model is: $$ \ln L (\beta, \sigma, \lambda) = -\frac{N}{2}\ln 2\pi - N \ln \sigma - (1 - \lambda) \sum_{n=1}^N \ln y_n - \frac{1}{2\sigma ^ 2}\sum_{n=1}^N\left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2 $$ The gradient of the log-likelihood is: $$ \frac{\partial \ln L}{\partial \theta}(\theta) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta}\\ \frac{\partial \ln L}{\partial \sigma}\\ \frac{\partial \ln L}{\partial \lambda}\\ \end{array} \right) = \left( \begin{array}{c} \frac{1}{\sigma ^ 2} \sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) x_n\\ -\frac{N}{\sigma} + \frac{1}{\sigma^3}\sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2\\ \sum_{n=1} ^ N \ln y_n - \frac{1}{\sigma ^ 2}\sum_{n=1}^N \left(y_n ^ {(\lambda)} - \beta^\top x_n\right) \left(y_n^{(\lambda)} \ln y_n - \frac{y_n ^ {(\lambda)} - \ln y_n}{\lambda}\right) \end{array} \right) $$ For the constrained linear ($\hat{\lambda}_{i} = 1$) and log-linear ($\hat{\lambda}_o = 0$) models, we have respectively: $$ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_i) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \left(1 + y(\ln y - 1)\right) \end{array} \right) $$ $$ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_o) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \ln y_n^2 \end{array} \right) $$ The following function returns the log-likelihood for a box-cox model, given a vector of parameters `coef`, a matrix of covariates `X` and a vector of response `y`. It has further arguments: - `sum`: if `FALSE`, invidual contributions to the log-likelihood (a vector of length $N$) and to the gradient (a matrix of dimensions $N\times K$ is returned, - `gradient`: if `TRUE` the analytical gradient is computed and returned as an attribute, - `hessian`: if `TRUE` the analytical hessian is computed and returned as an attribute. ```{r echo = TRUE} lnL_bc <- function(coef, X, y, sum = FALSE, gradient = FALSE, hessian = FALSE){ y <- y N <- length(y) K <- length(coef) - 3 beta <- coef[1:(K + 1)] sigma <- coef[K + 2] lambda <- coef[K + 3] bX <- as.numeric(X %*% beta) if (lambda == 0){ bc <- log(y) bc_l <- 1 / 2 * log(y) ^ 2 bc_ll <- 1 / 3 * log(y) ^ 3 } else{ bc <- (y ^ lambda - 1) / lambda bc_l <- bc * log(y) - (bc - log(y)) / lambda bc_ll <- bc_l * log(y) - (bc_l * lambda - (bc - log(y))) / lambda ^ 2 } e <- bc - bX lnL <- - 1 / 2 * log(2 * pi) - log(sigma) - (1 - lambda) * log(y) - 1 / (2 * sigma ^ 2) * e ^ 2 if (gradient){ g_beta <- 1 / sigma ^ 2 * e * X g_sigma <- - 1 / sigma + 1 / (sigma ^ 3) * e ^ 2 g_lambda <- log(y) - 1 / (sigma ^ 2) * e * bc_l G <- cbind(g_beta, g_sigma, g_lambda) } if (hessian){ h_bb <- - 1 / sigma ^ 2 * crossprod(X) h_bs <- apply(- 2 / sigma ^ 3 * X * e, 2, sum) h_bl <- apply( 1 / sigma ^ 2 * bc_l * X, 2, sum) h_ss <- sum(1 / sigma ^ 2 - 3 / sigma ^ 4 * e ^ 2) h_sl <- sum(2 / sigma ^ 3 * e * bc_l) h_ll <- sum(- 1 / sigma ^ 2 * (e * bc_ll + bc_l ^ 2)) H <- rbind(cbind(h_bb, sigma = h_bs, lambda = h_bl), sigma = c(h_bs, h_ss, h_sl), lambda = c(h_bl, h_sl, h_ll)) } if (sum){ lnL <- sum(lnL) if (gradient) G <- apply(G, 2, sum) } if (gradient) attr(lnL, "gradient") <- G if (hessian) attr(lnL, "hessian") <- H lnL } ``` Consider as an example the `cars` data set which is provided by base **R**. The response is `dist` (the stoping distance) and the only covariate is `speed` (the speed of the car). The smaller linear and log-linear models can be computed using `lm`: ```{r } lin_lm <- lm(dist ~ speed + I(speed ^ 2), cars) log_lm <- update(lin_lm, log(.) ~ .) ``` as stated previously, the linear model is a special case of the box-cox model if one is substracted from the intercept. ```{r } coefs_lin <- coef(lin_lm) coefs_lin[1] <- coefs_lin[1] - 1 ``` To use the `lnL_bc` function, we need to extract the response vector and the covariates matrix: ```{r } X <- model.matrix(lin_lm) y <- model.response(model.frame(lin_lm)) ``` Finally we can evaluate the log-likelihood of the box-cox model for the two special cases of the linear and the log-linear models using the coefficients obtained using `lm` ($\sigma$ is a parameter in the log-likelihood, it is extracted from the `lm` models using the provided `sigma2` function which differs from the `sigma` function by the fact that the deviance is divided by the number of observations and not the number of degrees of freedom of the regression). ```{r collapse = TRUE} sigma2 <- function(x) sigma(x) * sqrt(df.residual(x) / nobs(x)) coefs_lin <- c(coefs_lin, sigma = sigma2(lin_lm), lambda = 1) coefs_log <- c(coef(log_lm), sigma = sigma2(log_lm), lambda = 0) lnL_lin <- lnL_bc(coefs_lin, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE) lnL_log <- lnL_bc(coefs_log, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE) lnL_lin lnL_log ``` According to the log-likelihood, the log-linear model fits fairly better the data than the linear one. As expected, all the elements of the gradient vector except the last one is 0. The derivative of the log-likelihood with respect with $\lambda$ is negative for the linear model and positive for the log-linear model. This means that the log-likelihood should increase for some values of $\lambda$ between 0 and 1. To fit the box-cox log-likelihood model, we use the `maxLik::maxLik` function. It is a good general practise to provide "good" (and different) starting values to unsures that the non-linear optimization process converges, and that it converges to a global maximum. We therefore run twice `maxLik`, with starting values corresponding to the linear and to the non-linear model: ```{r warning = FALSE, message = FALSE} library("maxLik") bc_lin <- maxLik(lnL_bc, start = coefs_lin, X = X, y = y, sum = TRUE, gradient = TRUE, hessian = TRUE) bc_log <- maxLik(lnL_bc, start = coefs_log, X = X, y = y, sum = TRUE, gradient = TRUE, hessian = TRUE) ``` We first check that the fitted model is the same for the two sets of starting values: ```{r } cbind(coef(bc_lin), coef(bc_log)) ``` and we then summarize the fitted model: ```{r } summary(bc_lin) ``` As expected, the fitted value of $\lambda$ lies between 0 and 1 and is statistically different from 0 and from 1. Note also that the log-likelihood of the box-cox model (`r round(as.numeric(logLik(bc_lin)), 2)`) is larger than the one of the log-linear model (`r round(as.numeric(lnL_log), 2)`). If the hypothesis corresponding to the smaller model are true, the distribution of the score vector is: $$ g(\tilde{\theta}) \sim N(0, I(\theta)) $$ Were $I$ is the information matrix. Therefore if H~0~ is true: $$ g(\tilde{\theta}) ^ \top I(\theta) ^ {-1} g(\tilde{\theta}) $$ is a $\chi^2$ with $J$ degrees of freedom ($J$ being the number of hypothesis, one in our example). $I(\theta)=-\mbox{E}\left(\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\theta)\right)$ needs to be estimate and two natural estimators are: - the empirical variance of the gradient: $\sum_{n=1} ^ N g(\tilde{\theta}, x_n, y_n) g(\tilde{\theta}, x_n, y_n)^\top = G(\tilde{\theta}, X, y) ^ \top G(\tilde{\theta}, X, y)$ (also called the outer-product of the gradient), where $G$ is an $N\times K$ matrix for which each line is the contribution of an observation to the gradient, - the opposite of the hessian: $-H(\tilde{\theta}) = -\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\tilde{\theta})$. The $G$ matrix is obtained using the `lnL_bc` function with `sum = FALSE` and `gradient = TRUE` and the $H$ matrix using `hessian = TRUE`. ```{r } lnL_lin <- lnL_bc(coefs_lin, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE) lnL_log <- lnL_bc(coefs_log, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE) G_lin <- attr(lnL_lin, "gradient") H_lin <- attr(lnL_lin, "hessian") G_log <- attr(lnL_log, "gradient") H_log <- attr(lnL_log, "hessian") g_lin <- apply(G_lin, 2, sum) g_log <- apply(G_log, 2, sum) ``` The test statistics can then be computed using any of the two estimators of the information matrix: ```{r collapse = TRUE} as.numeric(g_lin %*% solve(- H_lin) %*% g_lin) as.numeric(g_lin %*% solve(crossprod(G_lin)) %*% g_lin) as.numeric(g_log %*% solve(- H_log) %*% g_log) as.numeric(g_log %*% solve(crossprod(G_log)) %*% g_log) ``` The two variants of the same test statistics are very close whatever the approximation of the information matrix used. The critical value of a $\chi^2$ with one degree of freedom at the 5% level being `r round(qchisq(0.95, df = 1), 2)`, the two hypothesis of linear and log normality are strongly rejected. Note that $g(\tilde{\theta}) = G(\tilde{\theta})^\top \iota$ where $\iota$ is a vector of one of length $N$. Therefore the outer product variant of the test is: $$ \iota^\top G(\tilde{\theta}) \left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top \iota = \iota^\top M_G(\tilde{\theta}) \iota $$ where $M_G(\tilde{\theta}) = G(\tilde{\theta})\left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top$ is the projection matrix on the columns of $G(\tilde{\theta})$. $M_G(\tilde{\theta}) \iota$ is therefore the fitted values of the regression of a column of ones with the columns of $G(\tilde{\theta})$ and, as $M_G(\tilde{\theta})$ is idempotent, the test statistic is the sum of the fitted values of this regression. The (uncetered as there is no intercept in this regression) $R^2$ is this sum of the fitted values divided by the total variation of the response, which is equal to $N$ (the response being a vector of ones of length $N$). Therefore, the test statistic can be obtained as $N$ times the $R^2$ of the regression of $1$ one on the columns of $G$: ```{r collapse = TRUE} iota <- rep(1, length(y)) lm_ones_lin <- lm(iota ~ G_lin - 1) lm_ones_log <- lm(iota ~ G_log - 1) summary(lm_ones_lin)$r.squared * nobs(lm_ones_lin) summary(lm_ones_log)$r.squared * nobs(lm_ones_log) ``` # The conditional moment test The conditional moment test approach consists on constructing a vector of moments which are 0 under H~0~. Denoting $\mu_n$ such a vector, we have, for the normality hypothesis: $$ \mu_n(\theta) = \left( \begin{array}{c} \mu_{3n}\\ \mu_{4n} \end{array} \right)= \left(\begin{array}{c} \mbox{E}(\epsilon_n ^ 3) \\ \mbox{E}(\epsilon_n ^ 4 - 3 \sigma ^ 4) \end{array}\right) = 0 $$ Denoting $\hat{\epsilon}_n = y_n - \tilde{\beta} ^ \top x_n$ or $\hat{\epsilon}_n = \ln y_n - \tilde{\beta} ^ \top x_n$ the residuals of the linear or the log-linear model, the empirical moments are: $$ m_n(\tilde{\theta}) = \left( \begin{array}{c} \hat{\epsilon}_n ^ 3 \\ \hat{\epsilon}_n ^ 4 - 3 \sigma ^ 4 \end{array} \right) $$ Define : $M (\tilde{\theta}) ^ \top = \left(m_1, m_2, \ldots, m_N\right)$ an $N\times r$ matrix for which each line is the moment vector for on observation and $W(\theta)= \frac{1}{N}\sum_n \mbox{E}\left(\frac{\partial m_n}{\partial \theta}(\theta)\right)$ a $K \times r$ matrix containing the derivatives of the empirical moments with the parameters of the model. Defining: $$ Q = \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right]^\top \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right] $$ the test statistic is: $$ m(\tilde{\theta}) ^ \top Q ^ {-1} m(\tilde{\theta}) $$ and under H~o~ follows a $\chi^2$ with $r$ (the length of $m$) degrees of freedom. $Q$ plays the role of the information matrix for the score test. As previously, different flavours of the test can be obtained depending on how $Q$ is estimated: - numerically, using as previously $\hat{I}_N = G(\tilde{\theta}) ^ \top G(\tilde{\theta})$ and $\hat{W}_N = G(\tilde{\theta}) ^ \top M(\tilde{\theta})$, - using the analytical derivatives, using as previously $\hat{I}_A = - H(\tilde{\theta})$ and $\hat{W}_A = - \frac{\partial M}{\partial \theta}(\tilde{\theta})$ In this latter case, the test statistic can also be obtained as $N$ times the $R^2$ of the regression of a column of one on the columns of $G$ and $M$. The following function performs the test. ```{r } cm_norm <- function(x, type = c("analytical", "opg", "reg")){ type <- match.arg(type) X <- model.matrix(x) y <- model.response(model.frame(x)) N <- length(y) K <- length(coef(x)) coefs <- c(coef(x), sigma = sigma2(x)) beta <- coefs[1:K] sigma <- coefs[K + 1] epsilon <- as.numeric(y - X %*% beta) G <- cbind(1 / sigma ^ 2 * X * epsilon, sigma = - 1 / sigma + 1 / sigma ^ 3 * epsilon ^ 2) M <- cbind(asym = epsilon ^ 3, kurt = epsilon ^ 4 - 3 * sigma ^ 4) m <- apply(M, 2, sum) if (type == "analytical"){ I <- - rbind(cbind(- 1 / sigma ^ 2 * crossprod(X), sigma = - 2 / sigma ^ 3 * apply(X * epsilon, 2, sum)), sigma = c(- 2 / sigma ^ 3 * apply(X * epsilon, 2, sum), 1 / sigma ^ 2 - 3 / sigma ^ 4 * sum(epsilon ^ 2))) W <- - rbind(cbind(asym = - 3 * apply(epsilon ^ 2 * X, 2, sum), kurt = - 4 * apply(epsilon ^ 3 * X, 2, sum)), sigma = c(0, - 12 * N / sigma ^ 3)) } if (type == "opg"){ I <- crossprod(G) W <- crossprod(G, M) } if (type != "reg"){ Q <- crossprod(M - G %*% solve(I) %*% W) stat <- as.numeric(m %*% solve(Q) %*% m) } else stat <- summary(lm(rep(1, N) ~ G + M - 1))$r.squared * N stat } ``` For the linear model, we get: ```{r collapse = TRUE} cm_norm(lin_lm, "analytical") cm_norm(lin_lm, "opg") cm_norm(lin_lm, "reg") ``` and the normality hypothesis is therefore strongly rejected. For the log-linear model: ```{r collapse = TRUE} cm_norm(log_lm, "analytical") cm_norm(log_lm, "opg") cm_norm(log_lm, "reg") ``` on the contrary, the normality hypothesis is not rejected. ```{r eval = FALSE, include = FALSE} JB <- function(x){ e <- resid(x) mu <- mean(e) sigma <- sqrt( mean( (e - mu) ^ 2)) m3 <- mean( (e - mu) ^ 3) m4 <- mean( (e - mu) ^ 4) Asym <- m3 / sigma ^ 3 Kurt <- m4 / sigma ^ 4 df.residual(x) / 6 * (Asym ^ 2 + (Kurt - 3) ^ 2 / 4) } ``` # References