What’s the Best R-Squared for Logistic Regression?

February 13, 2013 By Paul Allison

Allison PicOne of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” There are two general approaches to answering this question. One is to get a measure of how well you can predict the dependent variable based on the independent variables. The other is to test whether the model needs to be more complex, specifically, whether it needs additional nonlinearities and interactions to satisfactorily represent the data.

In a later post, I’ll discuss the second approach to model fit, and I’ll explain why I don’t like the Hosmer-Lemeshow goodness-of-fit test. In this post, I’m going to focus on R2 measures of predictive power. Along the way, I’m going to retract one of my long-standing recommendations regarding these measures.

Unfortunately, there are many different ways to calculate an R2 for logistic regression, and no consensus on which one is best. Mittlbock and Schemper (1996) reviewed 12 different measures; Menard (2000) considered several others. The two methods that are most often reported in statistical software appear to be one proposed by McFadden (1974) and another that is usually attributed to Cox and Snell (1989) along with its “corrected” version (see below). However, the Cox-Snell R(both corrected and uncorrected) was actually discussed earlier by Maddala (1983) and by Cragg and Uhler (1970).

Among the statistical packages that I’m familiar with, SAS and Statistica report the Cox-Snell measures.  JMP and SYSTAT report both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for binary logistic regression but McFadden’s measure for multinomial and ordered logit.  

For years, I’ve been recommending the Cox and Snell R2 over the McFadden R2, but I’ve recently concluded that that was a mistake. I now believe that McFadden’s R2 is a better choice. However, I’ve also learned about another R2 that has good properties, a lot of intuitive appeal, and is easily calculated. At the moment, I like it better than the McFadden R2. But I’m not going to make a definite recommendation until I get more experience with it.

Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let L0 be the value of the likelihood function for a model with no predictors, and let LM be the likelihood for the model being estimated. McFadden’s R2 is defined as

     R2McF = 1 – ln(LM) / ln(L0)

where ln(.) is the natural logarithm. The rationale for this formula is that ln(L0) plays a role analogous to the residual sum of squares in linear regression. Consequently, this formula corresponds to a proportional reduction in “error variance”. It’s sometimes referred to as a “pseudo” R2.

The Cox and Snell R2 is

     R2C&S = 1 – (L0 / LM)2/n

where n is the sample size. The rationale for this formula is that, for normal-theory linear regression, it’s an identity. In other words, the usual R2 for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula. It’s appropriate, then, to describe this as a “generalized” R2 rather than a pseudo R2.  By contrast, the McFadden R2 does not have the OLS R2 as a special case. I’ve always found this property of the Cox-Snell R2 to be very attractive, especially because the formula can be naturally extended to other kinds of regression estimated by maximum likelihood, like negative binomial regression for count data or Weibull regression for survival data.

It’s well known, however, that the big problem with the Cox-Snell R2 is that it has an upper bound that is less than 1.0. Specifically, the upper bound is 1 – L02/n.This can be a lot less than 1.0, and it depends only on p, the marginal proportion of cases with events:

            upper bound = 1 – [pp(1-p)(1-p)]2

This has a maximum of .75 when p=.5. By contrast, when p=.9 (or .1), the upper bound is only .48.

For those who want an R2 that behaves like a linear-model R2, this is deeply unsettling. There is a simple correction, and that is to divide R2C&S by its upper bound, which produces the R2 attributed to Nagelkerke (1991). But this correction is purely ad hoc, and it greatly reduces the theoretical appeal of the original R2C&S.  I also think that the values it typically produces are misleadingly high, especially compared with what you get from a linear probability model. (Some might view this as a feature, however).

So, with some reluctance, I’ve decided to cross over to the McFadden camp. As Menard (2000) argued, it satisfies almost all of Kvalseth’s (1985) eight criteria for a good R2. When the marginal proportion is around .5, the McFadden R2 tends to be a little smaller than the uncorrected Cox-Snell R2. When the marginal proportion is nearer to 0 or 1, the McFadden R2 tends to be larger.

But there’s another R2, recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to R2 definitions for linear models. It’s also easy to calculate.

The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!

The motivation should be clear. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Tjur also showed that his R2 (which he called the coefficient of discrimination) is equal to the arithmetic mean of two R2 formulas based on squared residuals, and equal to the geometric mean of two other R2’s based on squared residuals. 

Here’s an example of how to calculate Tjur’s statistic in Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable inlf is coded 1 if a woman was in the labor force, otherwise 0.  A logistic regression model was fit with six predictors.

Here’s the code:

use “http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta“, clear

logistic inlf kidslt6 age educ huswage city exper

predict yhat if e(sample)

ttest yhat, by(inlf)

The predict command produces fitted values and stores them in a new variable called yhat. (The if e(sample) code prevents predicted values from being calculated for cases that may be excluded from the regression model). The ttest command is the easiest way to get the difference in the means of the predicted values for the two groups (but you can ignore the p-values). The mean predicted value for those in the labor force was .680, while the mean predicted value for those not in the labor force was .422. The difference of .258 is the Tjur R2. By comparison, the Cox-Snell R2 is .248 and the McFadden R2 is .208. The corrected Cox-Snell is .332.

Here’s the equivalent SAS code:

proc logistic data=my.mroz;

  model inlf(desc) = kidslt6 age educ huswage city exper;

  output out=a pred=yhat;

proc ttest data=a;

  class inlf; var yhat; run;

One possible objection to the Tjur R2 is that, unlike Cox-Snell and McFadden, it’s not based on the quantity being maximized, namely, the likelihood function.* As a result, it’s possible that adding a variable to the model could reduce the Tjur R2But Kvalseth (1985) argued that it’s actually preferable that R2 not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods. For example, one might want to compare predictions based on logistic regression with those based on a classification tree method.

Another potential complaint is that the Tjur R2 cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward. 

If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application, Second Edition (2012), or try my seminars on Logistic Regression Using SAS or Logistic Regression Using Stata.

* Conjecture: I suspect that the Tjur R2 is maximized when logistic regression coefficients are estimated by the linear discriminant function method. I encourage any interested readers to try to prove (or disprove) that. (For background on the relationship between discriminant analysis and logistic regression, see Press and Wilson (1984)).

References:

Cragg, J.G. and R.S. Uhler (1970) “The demand for automobiles.” The Canadian Journal of Economics 3: 386-406.

Cox, D.R. and E.J. Snell (1989) Analysis of Binary Data. Second Edition. Chapman & Hall.

Kvalseth, T.O. (1985) “Cautionary note about R2.”  The American Statistician: 39: 279-285.

McFadden, D.  (1974) “Conditional logit analysis of qualitative choice behavior.”  Pp. 105-142 in P. Zarembka (ed.), Frontiers in Econometrics.  Academic Press.

Nagelkerke, N.J.D. (1991) “A note on a general definition of the coefficient of determination.” Biometrika 78: 691-692.

Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics. Cambridge University Press.

Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis.” The American Statistician 54: 17-24.

Mittlbock, M. and M. Schemper (1996) “Explained variation in logistic regression.” Statistics in Medicine 15: 1987-1997.

Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765-799.

Press, S.J. and S. Wilson (1978) “Choosing between logistic regression and discriminant analysis.” Journal of the American Statistical Association 73: 699-705.

Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372.


21 Responses

  1. This morning I checked Paul Allison's Statistical Horizons blog and found a post on measures for logistic regression. It introduced me to Tjur's by way of an example, which I repackaged. [Click on the link above to see code].

  2. Zuber D. Mulla says:

    Lucid advice, and useful…especially helpful for students transitioning from linear regression to logistic regresion.

  3. Omer Barak says:

    Great post!
    I believe you accidentally “flipped” the Cox & Snell R^2…
    It should be [1 – (L0 / LM)]^2/n and not [1 – (LM / L0)]^2/n.
    (that’s how it’s written in Nagelkerke’s paper).

  4. Richard Williams says:

    Very interesting. On a trivial point, I believe the Tjur stat is the absolute value of the difference, as the difference comes up negative, at least in this example.

  5. Lixi says:

    Great post! I actually have a question about the model form of hazard analysis. I’ve been using the book “survival analysis using SAS”(very useful!) and it seems all the models in the book use a exponential form: h=exp(a+b*X1+c*X2), say h means hazard, and X1,X2 are the independent variables. I noticed though when I use a power form, say something like: h=X1^(a+b*X2), then changing the unit of X1 would change significance test result, and even AIC. I was wondering if you ever encountered this or what’s your suggestions on this. I apologize if this is not the appropriate place to ask this, but I’m really curious. Thanks!

    • Paul Allison says:

      I’m not surprised that changing the units of x1 would substantially change the results. Unless x1 has a coefficient, there is nothing to absorb changes in units. But why would you even consider a model like this?

      • Lixi says:

        Thanks for the reply! This model was used by a former graduate student in our lab 10 years ago. He recalled this power model as the most “efficient” and “easiest” form for his data (he couldn’t recall many details). What’s interesting is that when I run a power form and exponential form on our data, the former model seems to always come up with smaller AIC. The significant test results would be different depending on the covariates included. So we were wondering if there is a reason to prefer one to the other, or maybe it’s data specific?
        Thanks!

      • Lixi says:

        And a small correction about the models, the power form should be h=a*X1^(b+c*X2), is this “a” what you meant by “X1 has a coefficient”? But I think it didn’t absorb changes in units. A parallel exponential form we compare result with would be: h=exp(a+b*X1+c*X1*X2). Sorry about the confusion.

  6. James Swartz says:

    I like this test postestimation for a regular binary logistic. However, it seems not to work when running a Firth logistic regression and produces values that are larger than 1. A quick check of the predicted values shows why as they predicted scores are not longer bounded by 1 after running these models.

    Is there any way to get the equivalent of Tjur’s coefficient of discrimination after running a Firth logistic regression in Stata?

    • Paul Allison says:

      I think the problem is that when you use the firthlogit command in Stata, the predict command does not produce predicted probabilities. Instead, it gives the linear predictor or, equivalently, the log-odds. To get probabilities, you need to do the following:

      firthlogit y x1 x2 x3
      predict logodds
      gen predprob=1/(1+exp(-logodds))

      Then calculate the Tjur R2 using these predicted probabilites.

  7. […] Econometricians especially are fascinated to use some type of model explained type concept and introduce all kinds of pseudo-R-square concepts, implicating the % explained type information using loglikelihood values of no model vs. model.  For a whole collection of references on various pseudo-R-square along with McFadden’s pseudo-Rsquare, see in the nice logical article, http://www.statisticalhorizons.com/r2logistic. by Paul Allison, the legendary logistic regression se… […]

  8. Robert Feyerharm says:

    Thanks for posting about Tjur’s R2 – how is Tjur commonly pronounced?

  9. Jose says:

    I am trying to use Tjur’s R2 after a Firth regression, but am getting very strange outputs; eg the mroz data above gives an R2 = 1.3.

    This must be due to the penalized likelihood, but can it be adjusted by weighting, or?

    • Paul Allison says:

      What software are you using? As long as the predicted values are between 0 and 1, calculation of the Tjur R2 shouldn’t be a problem. The Firth method should produce predicted values between 0 and 1.

  10. Marge says:

    I understand that these pseudo R-squares should be interpreted NOT as a proportion of variance explained as in OLS multiple regression but rather (and this makes sense) as small, medium, or large effects. The problem is that I don’t see any general guidelines as to what values of a ‘pseudo’ R-square would constitute a ‘small’ ‘medium’ or ‘large’ effect. Obviously much depends on the data set but, do you have any general suggestions

  11. Nabila says:

    Hi Paul, I have a logistic regression model for which i was looking at goodness of fit tests. The Hosmer and Lemeshow test is significant for my data as the number of rows is more than 10,000. The Nagerkerke’s R2 value for my model is about 0.32, but the percentage concordance(as reported in SAS) is 79%. ALso, in the classification table, percentage correctly classified by the model is 75%. I tested this out on a random sample and got 76% cases are correctly classified. Can you suggest some other measures by which I can validate my model and check its goodness of fit? Thanks, N

Leave a Reply