Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression

March 5, 2013 By Paul Allison

Allison PicThe Hosmer-Lemeshow (HL) test for logistic regression is widely used to answer the question “How well does my model fit the data?”  But I’ve found it to be unsatisfactory for several reasons that I’ll explain in this post.

First, some background. Last month I wrote about several R2 measures for logistic regression, which is one approach to assessing model fit.  R2 is a measure of predictive power, that is, how well you can predict the dependent variable based on the independent variables. That may be an important concern, but it doesn’t really address the question of whether the model is consistent with the data.

By contrast, goodness-of-fit (GOF) tests help you decide whether your model is correctly specified. They produce a p-value—if it’s low (say, below .05), you reject the model. If it’s high, then your model passes the test.

In what ways might a model be misspecified? Well, the most important potential problems are interactions and nonlinearities. You can always produce a satisfactory fit by adding enough interactions and nonlinearities. But do you really need them? GOF tests are designed to answer that question. Another issue is whether the “link” function is correct. Is it logit, probit, complementary log-log, or something else entirely?

For both linear and logistic regression, it’s possible to have a low R2 and still have a model that is correctly specified in every respect. And vice versa, you can have a very high R2 and yet have a model that is grossly inconsistent with the data.

GOF tests are readily available for logistic regression when the data can be aggregated or grouped into unique “profiles”. Profiles are groups of cases that have exactly the same values on the predictors. Suppose for example, that the model has just two predictor variables, sex (1=male, 0=female) and marital status (1=married, 0=unmarried). There are then four profiles: married males, unmarried males, married females and unmarried females, presumably with many cases in each profile.  

Suppose we then fit a logistic regression model with the two predictors, sex and marital status (but not their interaction). For each profile, we can get an observed number of events and an expected number of events based on the model. There are two well-known statistics for comparing the observed number with the expected number: the deviance and Pearson’s chi-square.

The deviance is a likelihood ratio test of the fitted model versus a “saturated” model that perfectly fits the data. In our hypothetical example, a saturated model would include the interaction of sex and marital status. In that case, the deviance is testing the “no interaction” model as the null hypothesis, with the interaction model as the alternative. A low p-value suggests that the simpler model (without the interaction) should be rejected in favor of the more complex one (with the interaction). Pearson’s chi-square is an alternative method for testing the same hypothesis. It’s just the application of Pearson’s familiar formula for comparing observed with expected numbers of events (and non-events).

Both of these statistics have good properties when the expected number of events in each profile is at least 5. But most contemporary applications of logistic regression use data that do not allow for aggregation into profiles because the model includes one or more continuous (or nearly continuous) predictors. When there is only one case per profile, both the deviance and Pearson chi-square have distributions that depart markedly from a true chi-square distribution, yielding p-values that may be wildly inaccurate.

What to do? Hosmer and Lemeshow (1980) proposed grouping cases together according to their predicted values from the logistic regression model. Specifically, the predicted values are arrayed from lowest to highest, and then separated into several groups of approximately equal size. Ten groups is the standard recommendation.

For each group, we calculate the observed number of events and non-events, as well as the expected number of events and non events. The expected number of events is just the sum of the predicted probabilities over the individuals in the group. And the expected number of non-events is the group size minus the expected number of events.

Pearson’s chi-square is then applied to compare observed counts with expected counts. The degrees of freedom is the number of groups minus 2. As with the classic GOF tests, low p-values suggest rejection of the model.

It seems like a clever solution, but it turns out to have serious problems. The most troubling problem is that results can depend markedly on the number of groups, and there’s no theory to guide the choice of that number. This problem did not become apparent until software packages started allowing you to specify the number of groups, rather than just using 10.

Here’s an example using Stata with the famous Mroz data set that I used in last month’s post. The sample consists of 753 women, and the dependent variable is whether or not a woman is in the labor force. Here is the Stata code for producing the HL statistic based on10 groups:

use http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta, clear

logistic inlf kidslt6 age educ huswage city exper

estat gof, group(10)

The estat gof command produces a chi-square of a 15.52 with 8 df, yielding a p-value of .0499—just barely significant. This suggests that the model is not a satisfactory fit to the data, and that interactions and non-linearities are needed (or maybe a different link function). But if we specify 9 groups using the option group(9), the p-value rises to .11. And with group(11), the p-value is .64. Clearly, it’s not acceptable for the results to depend so greatly on such minor changes to a test characteristic that is completely arbitrary. Examples like this one are easy to come by.

But wait, there’s more. One would hope that adding a statistically significant interaction or non-linearity to a model would improve its fit, as judged by the HL test. But often that doesn’t happen. Suppose, for example, that we add the square of exper (labor force experience) to the model, allowing for non-linearity in the effect of experience. The squared term is highly significant (p=.002). But with 9 groups, the HL chi-square increases from 11.65 (p=.11) in the simpler model to 13.34 (p=.06) in the more complex model. That result suggests that we’d be better off with the model that excludes the squared term.

The reverse can also happen. Quite frequently, adding a non-significant interaction or non-linearity to a model will substantially improve the HL fit. For example, I added the interaction of educ and exper to the basic model above. The product term had a p-value of .68, clearly not statistically significant. But the HL chi-square (based on 10 groups) declined from 15.52 (p=.05) to 9.19 (p=.33). Again, unacceptable behavior.

If the HL test is no good, then how can we assess the fit of the model? It turns out that there’s been quite a bit of recent work on this topic.  In next month’s post, I’ll describe some of the newer approaches.

If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application, Second Edition (2012), or try my seminars on Logistic Regression Using SAS or Logistic Regression Using Stata.


Hosmer D.W. and Lemeshow S. (1980) “A goodness-of-fit test for the multiple logistic regression model.” Communications in Statistics A10:1043-1069.              


18 Responses

  1. Roger Keller says:

    Very good explanation. I have seen this problem in my analyses too and could not find a “right” number of groups for the HL test…just beacause there isn’t one.Thanks.

  2. Quin says:

    H-L test fails most of the time in very large datasets commonly see the financial industry. Any better tests to deal this situation will be very helpful.

  3. Matt Bogard says:

    I’ve also seen several criticisms that the HL test is too sensitive to large sample sizes. I’m not sure of the validity of this criticism, but look forward to next month’s article- maybe the new approaches you are referring to will address this issue if it is valid.

    For instance:

    Volume 12, Number 2, 2009
    “The Hosmer-Lemeshow test detected a statistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small”


    SIZE MATTERS TO A MODEL’s FIT (comment in Crit Care Med. 2007: Sep 35(9):2213

    “Caution should be used in interpreting the calibration of predictive models developed using a smaller data set when applied to larger numbers of patients. A significant Hosmer-Lemeshow test does not necessarily mean that a predictive model is not useful or suspect. While decisions concerning a mortality model’s suitability should include the Hosmer-Lemeshow test, additional information needs to be taken into consideration. This includes the overall number of patients, the observed and predicted probabilities within each decile, and adjunct measures of model calibration.”

    and from STATA LIST comments:


    “It follows that with large sample sizes any discrepancy between the model and the data will be magnified, resulting in small p-values for a goodness of fit test.”

    • Paul Allison says:

      The large sample size issue is a potential problem with ANY goodness of fit test. With large sample sizes, even trivial departures from the model specification are likely to show up as statistically significant. Actually, simulation results suggest that the HL test has relatively LOW power for detecting certain kinds of model specification, especially interactions.

  4. Robin says:

    I look forward to the next post on this topic. I’m dealing with a CPS dataset with nearly 100,000 observations and find the H-L test to be significant, yet looking at the tables the counts in the expected/observed columns are very close, not different enough to warrant changes to a model that is theoretically very sound.

    What are your thoughts on the link test (Stata linktest command)?

  5. William Chiu says:

    I propose calculating the HL statistic on the “hold-out sample” rather than the “model development sample”. Assuming you have a lot of data, you can do a 75% development data set, and 25% hold-out data set.

    If you don’t have enough data points for a hold-out data set, I recommend the BIC which penalizes for model complexity. http://en.wikipedia.org/wiki/Bayesian_information_criterion

  6. emeryL says:

    Are you still planning a follow-up article on a good alternative to the HL test? I’d be very interested to read it.

    This article was really helpful!

  7. Angelica says:

    Very helpful. Thank you!

  8. Rogelio says:


    I have just read this post and I have found it really interesting. That is the reason I am looking forward to read the post on a good alternative to this HL test (which, in fact, has driven me crazy these last three months). Where can I find the explanations on those good alternatives?

    Thank you very much.
    Rogelio Pujol
    Statistical Researcher

  9. Sarah says:

    Is le Cessie and Houwelingen test better?

  10. Neil Shephard says:

    You might be interested in this article from Hosmer & Lemeshow (and a couple of others) who critique the Hosmer-Lemeshow goodness-of-fit test and looks at how it and others actually perform (I took away from it that none of them are that great)…



  11. Jean says:

    A clearer explanation and a very helpful description of the HL’ test of GOF.
    Thank you!

  12. Wei says:

    Hi Paul,

    Have you published a paper on the this particular finding? if so, would you please provide me with a link so I can refer to it in my work.



Leave a Reply