Logistic Regression for Rare Events

February 13, 2012 By Paul Allison

Portrait of Paul AllisonPrompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare. Although King and Zeng accurately described the problem and proposed an appropriate solution, there are still a lot of misconceptions about this issue. 

The problem is not specifically the rarity of events, but rather the possibility of a small number of cases on the rarer of the two outcomes.  If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.

There’s nothing wrong with the logistic model in such cases. The problem is that maximum likelihood estimation of the logistic model is well-known to suffer from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. So even with a sample size of 100,000, if there are only 20 events in the sample, you may have substantial bias.

What’s the solution?  King and Zeng proposed an alternative estimation method to reduce the bias. Their method is very similar to another method, known as penalized likelihood, that is more widely available in commercial software. Also called the Firth method, after its inventor, penalized likelihood is a general approach to reducing small-sample bias in maximum likelihood estimation. In the case of logistic regression, penalized likelihood also has the attraction of producing finite, consistent estimates of regression parameters when the maximum likelihood estimates do not even exist because of complete or quasi-complete separation.

Unlike exact logistic regression (another estimation method for small samples but one that can be very computationally intensive), penalized likelihood takes almost no additional computing time compared to conventional maximum likelihood. In fact, a case could be made for always using penalized likelihood rather than conventional maximum likelihood for logistic regression, regardless of the sample size. Does anyone have a counter-argument?  If so, I’d like to hear it.

You can learn more about penalized likelihood in my seminar Logistic Regression Using SAS.


Reference:
Gary King and Langche Zeng. “Logistic Regression in Rare Events Data.” Political Analysis 9 (2001): 137-163.


93 Responses

  1. Dr. Md. Zakir Hossain says:

    I am thinking to use Poisson regression in case where event is rare, since p (probability of success) is very small and n (sample size is large).

  2. Paul Allison says:

    This has no advantage over logistic regression. There’s still small sample bias if the number of events is small. Better to use exact logistic regression (if computationally practical) or the Firth method.

    • Rose Ignacio says:

      Can you please explain further why you say Poisson regression has no advantage over logistic regression when we have rare events? Thanks.

      • Paul Allison says:

        When events are rare, the Poisson distribution provides a good approximation to the binomial distribution. But it’s still just an approximation, so it’s better to go with the binomial distribution, which is the basis for logistic regression.

  3. Partha says:

    Is this the case with PHREG as well? If you have 50 events for 2000 observations, will using the firth option the appropriate one if your goal is to not only model likelihood but also the median time to event?

    • Paul Allison says:

      The Firth method can be helpful in reducing small-sample bias in Cox regression, which can arise when the number of events is small. The Firth method can also be helpful with convergence failures in Cox regression, although these are less common than in logistic regression.

      • Tarana Lucky says:

        I am interested to determine what are the significant factors associated an “outcome”, which is a binary variable in my sample.My sample size from a cross-sectional survey is 20,000 and the number of respondents with presence of “outcome” is 70. Which method would be appropriate, multiple logistic or poisson regression?

        Thanks.

  4. harry says:

    “Does anyone have a counter-argument? If so, I’d like to hear it.”
    I usually default to using Firth’s method, but in some cases the true parameter really is infinite. If the response variable is presence of testicular cancer and one of the covariates is sex, for example. In that case, it’s obvious that sex should not be in the model, but in other cases it might not be so obvious, or the model might be getting fit as part of an automated process.

  5. Partha says:

    On a different note, I have read in Paul’s book that when there is a proportionality violation, creating time-varying covariates with the main predictor, and testing for its significance is both the diagnosis and the cure.

    So, if the IV is significant after the IV*duration is also significant, then, are we ok to interpret the effect?

    How does whether the event is rare or not affect the value of the above procedure?

    • Paul Allison says:

      Yes, if the IV*duration is significant, you can go ahead and interpret the “effect” which will vary with time. The rarity of the event reduces the power of this test.

  6. Georg Heinze says:

    I fully agree with Paul Allison. We have done extensive simulation studies with small samples, comparing the Firth method with ordinary maximum likelihood estimation. Regarding point estimates, the Firth method was always superior to ML. Furthermore, it turned out that confidence intervals based on the profile penalized likelihood were more reliable in terms of coverage probability than those based on standard errors. Profile penalized likelihood confidence intervals are available, e.g., in SAS/PROC LOGISTIC and in the R logistf package.

  7. elham says:

    Hi,
    I am a phD student at biostatistics. I have a data set with approximately 26000 cases where there are only 110 events. I used the method of weighting for rare events in Gary King article. My goal was to estimate ORs in a logistic regression,unfortunetly standard errors and confidence intervals are big , and there is a little difference with usual logistic regression. I dont no why, what is your idea? can I use penalized likelihood?

    • Paul Allison says:

      My guess is that penalized likelihood will give you very similar results. 110 events is enough so that small sample bias is not likely to be a big factor–unless you have lots of predictors, say, more than 20. But the effective sample size here is a lot closer to 110 than it is to 26,000. So you may simply not have enough events to get reliable estimates of the odds ratios. There’s no technical fix for that.

      • kiran kumar says:

        Paul,

        Please clear me this. I have the sample of 16000 observations with equal number of good and bads. Is it good way of building the model or should I reduce the bads.

  8. Adwait says:

    Hi Dr. Allison,

    If the event I am analyzing is extremely rare (1 in 1000) but the available sample is large (5 million) such that there are 5000 events in the sample, would logistic regression be appropriate? There are about 15-20 independent variables that are of interest to us in understanding the event. If an even larger sample would be needed, how much larger should it be at a minimum?

    If logistic regression is not suitable, what are our options to model such an event?

    Thanks,
    Adwait

    • Paul Allison says:

      Yes, logistic regression should be fine in this situation. Again, what matters is the number of the rarer event, not the proportion.

  9. Kar says:

    Hi Dr. Allison,

    I have a small data set (100 patients), with only 25 events. Because the dataset is small, I am able to do an exact logistic regression. A few questions…

    1. Is there a variable limit for inclusion in my model? Does the 10:1 rule that is often suggested still apply?
    2. Is there a “number” below which conventional logistic regression is not recommended…i.e. 20?

    Thanks and take care.

    • Paul Allison says:

      1. I’d probably be comfortable with the more “liberal” rule of thumb of 5 events per predictor. Thus, no more than 5 predictors in your regression.
      2. No there’s no lower limit, but I would insist on exact logistic regression for accurate p-values.

      • SAM2013 says:

        Dr. Allison,

        I benefited a lot from your explanation of Exact logistic regression and I read your reply on this comment that you would relax the criteria to only 5 events per predictor instead of 10. I am in this situation right now and I badly need your help. I will have to be able to defend that and I wanna know if there is evidence behind the relaxed 5 events per predictor rule with exact regression?

        Thanks a lot.

        • Paul Allison says:

          Below are two references that you might find helpful. One argues for relaxing the 10 events per predictor rule, while the other claims that even more events may be needed. Both papers focus on conventional ML methods rather than exact logistic regression.

          Vittinghoff, E. and C.E. McCulloch (2006) “Relaxing the rule of ten events per variable in logistic and Cox regression.” American Journal of Epidemiology 165: 710-718.

          Courvoisier, D.S., C. Combescure, T. Agoritsas, A. Gayet-Ageron and T.V. Perneger (2011) “Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure.” Journal of Clinical Epidemiology 64: 993-1000.

      • SAM2013 says:

        Hello again,

        I also wanted to confirm this from you, that if I have the gender as a predictor (male, female), this is considered as TWO and not one variables, right?

        Thanks.

        • Paul Allison says:

          Gender is only one variable.

          • SAM2013 says:

            Thank you very much for your help. I guess I gave you a wrong example for my question. I wanted to know if a categorical variable has more than two levels, would it still be counted as one variable for the sake of the rule we are discussing?

            Also, do we have to stick to the 5 events per predictor if we use Firth, or can we violate the rule completely, and if it is OK to violate it, do I have to mention a limitation about that?

            Sorry for the many questions.

            Thanks

          • Paul Allison says:

            What matters is the number of coefficients. So a categorical variable with 5 categories would have four coefficients. Although I’m not aware of any studies on the matter, my guess is that the same rule of thumb (of 5 or 10 events per coefficient), would apply to the Firth method. Keep in mind, however, that this is only the roughest rule of thumb. It’s purpose is to ensure that the asymptotic approximations (consistency, efficiency, normality) aren’t too bad. But it is not sufficient to determine whether the study has adequate power to test the hypotheses of interest.

  10. Joe says:

    Hi Dr. Allison,

    You mention in your original post that if a sample has 100,000 cases with 2,000 events, you’re golden. My question is this: from that group of 100,000 cases with 2,000 or so events, what is the appropriate sample size for analysis? I am working with a population of about 100,000 cases with 4,500 events; I want to select a random sample from this, but don’t want the sample to be too small (want to ensure there are enough events in the analysis). A second follow up question – is it ok for my cutoff value in logistic regression to be so low (around 0.04 or so?)
    Thank so much for any help you can provide!
    Joe

    • Paul Allison says:

      My question is, do you really need to sample? Nowadays, most software packages can easily handle 100,000 cases for logistic regression. If you definitely want to sample, I would take all 4500 cases with events. Then take a simple random sample of the non-events. The more the better, but at least 4500. This kind of disproportionate stratified sampling on the dependent variable is perfectly OK for logistic regression (see Ch. 3 of my book Logistic Regression Using SAS). And there’s no problem with only .04 of the original sample having events. As I said in the blog post, what matters is the number of events, not the proportion.

  11. Mohammed Shamsul Karim says:

    Dear Paul

    I am using a data set of 86,000 observations to study business start-up. The most of the responses are dichotomous. Business start-up rate is 5% which is dependent variable. I used logistic regression and result shows all 10 independent variables are highly significant. I tried rare event and got same result. People are complaining for highly significant result and saying the result may be biased. Would you please suggest me?

    Regards

  12. Athanasios Theofilatos says:

    I am going to analyze a situation where there are 97 non-events and only 3 events… i will try rare-events logistic as well as bayesian logistic…

    • Paul Allison says:

      With only three events, no technique is going to be very reliable. I would probably focus on exact logistic regression.

  13. Kara says:

    I am looking at comparing trends in prescription rates over time from a population health database. The events are in the range of 1500 per 100000 people +/- each of 5 years.
    The Cochrane Armitage test for trend or logistic regression always seem to be significant even though event rate is going from 1.65 to 1.53. Is there a better test I should be performing or is this just due to large population numbers yielding high power?

    thank you,

  14. Vaidy says:

    Dear Dr. Allison,

    I have an unbalanced panel data on low birth weight kids. I am interested in evaluating the probability of hospital admissions (per 6-months) between 1 to 5 years of age. Birth weight categories are my main predictor variables of interest, but I would also want to account for their time varying effects, by interacting BW categories with age-period. The sample size of the cohort at age1 is ~51,000 but the sample size gets reduced to 19,000 by age5. Hospital admissions in the sample at yrs 1 and 5 are respectively 2,246 and 127. Are there issues in using the logistic procedure in the context of an unbalanced panel data such as the one I have ? Please provide your thoughts as they may apply to 1)pooled logistic regression using cluster robust SE and 2)using a fixed/random effects panel approach ? Many thanks in advance.

    Best,

    Vaidy

    • Paul Allison says:

      Regarding the unbalanced sample, a lot depends on why it’s unbalanced. If it’s simply because of the study design (as I suspect), I wouldn’t worry about it. But if it’s because of drop out, then you have to worry about the data not being missing completely at random. If that’s the case, maximum likelihood methods (like random effects models) have the advantage over simply using robust standard errors. Because FE models are also ML estimates, they should have good properties also.

      • Vaidy says:

        Dr.Allison,

        Thanks for your response. I guess I am saying I have two different issues here with my unbalanced panel: 1)the attrition issue that you rightly brought up; 2) i am concerned about incidental parameters problem by using fixed/random effects logistic regression with heavily attrited data. I ran some probit models to predict attrition and it appears that attrition in my data is mostly random. Is the second issue regarding incidental parameters problem really of concern ? Each panel in my data is composed of minimum two waves. Thanks.

        • Paul Allison says:

          First, it’s not possible to tell whether your attrition satisfies the missing at random condition. MAR requires that the probability of a datum being missing does not depend on the value of that datum. But if you don’t observe it, there’s not way to tell. Second, incidental parameters are not a problem if you estimate the fixed effects model by way of conditional likelihood.

          • Vaidy says:

            Thanks for clarifying about the incidental parameters problem. I get your point about the criteria for MAR, that the missigness should not depend on the value of the datum. Key characteristics that could affect attrition are not observed in my data (e.g. SES, maternal characteristics, family income etc.). If there is no way to determine MAR, will it be fine to use a weighting procedure based on the theory of selection on observables ? For e.g. Fitzgerald and Moffit (1998) developed an indirect method to test attrition bias in panel data by using lagged outcomes to predict non-attrition. They call the lagged outcomes as auxillary variables. I ran probit regressions using different sets of lagged outcomes (such as lagged costs, hospitalization status, disability status etc.)and none of the models predicted >10% variation in non-attrition. This essentially means that attrition is probably not affected by observables. But should I still weight my observations in the panel regressions using the predicted probabilities of non-attrition from the probit models ?

            Of course, I understand that this still does not address selection on unobservables [and hence your comment about I cannot say that data is missing at random].

            Thanks,

            Vaidy

          • Paul Allison says:

            MAR allows for selection on observables. And ML estimates of fixed and random effects automatically adjust for selection on observables, as long as those observables are among the variables in the model. So there’s no need to weight.

  15. Kyle says:

    Dr. Allison,

    I’m wondering your thoughts on this off-the-cuff idea: Say I have 1000 samples and only 50 cases. What if I sample 40 cases and 40 controls, and fit a logistic regression either with a small number of predictors or with some penalized regression. Then predict the other 10 cases with my coefficients, save the MSE, and repeat the sampling, many, many times (say, B). Then build up an estimate for the ‘true’ coefficients based on a weighted average of the B inverse MSEs and beta vectors. ok idea or hugely biased?

    • Paul Allison says:

      I don’t see what this buys you beyond what you get from just doing the single logistic regression on the sample of 1000 using the Firth method.

  16. Mathews says:

    Hi Dr.Allison ,

    In the case of rare event logistic regressions ( sub 1% ) , would the pseudo R2( Cox and Snell etc ) be a reliable indicator of the model fit since the upper bound of the same depends on the overall probability of occurrence of the event itself. Would a low R2 still represent a poor model ? I’m assuming the confusion matrix may no longer be a great indicator of the model accuracy either ….

    Thanks

    • Paul Allison says:

      McFadden’s R2 is probably more useful in such situations than the Cox-Snell R2. But I doubt that either is very informative. I certainly wouldn’t reject a model in such situations just because the R2 is low.

  17. Yotam says:

    Dear Dr. Allison,

    I am analyzing the binary decisions of 500,000 individuals across two periods (so one million observations total). There were 2,500 successes in the first period, and 6,000 in the second. I estimate the effects of 20 predictors per period (40 total). For some reason, both logit and probit models give me null effects to variables that are significant under a linear probability model.

    Any thoughts on why this might be the case? Thanks very much.

    • Paul Allison says:

      Good question. Maybe the LPM is reporting inaccurate standard errors. Try estimating the LPM with robust standard errors.

      • Yotam says:

        Thanks so much for the suggestion. I did use robust standard errors (the LPM requires it as it fails the homoskedasticity assumption by construction), and the variables are still significant under the LPM. I recall reading somewhere that the LPM and logit/probit may give different estimates when modeling rare events, but cannot find a reference supporting this or intuit myself why this might by the case.

        • Paul Allison says:

          It does seem plausible that results from LPM and logit would be most divergent when the overall proportion of cases is near 1 or 0, because that’s where there should be most discrepancy between a straight line and the logistic curve. I have another suggestion: check for multicollinearity in the variable(s) that are switching significance. Seemingly minor changes in specification can have major consequences when there is near-extreme collinearity.

          • Yotam says:

            Thanks, and so sorry for the late reply. I think you are right that collinearity may be responsible for the difference. In my analysis, I aim to find the incremental effect of several variables in the latter period (post-treatment) above and beyond effects in the eariler period (pre-treatment). Every variable thus enters my model twice, once alone and once interacted with a period indicator. The variables are, of course, very correlated to themselves interacted with the indicator. Thanks again!

  18. Hongmei says:

    Paul, I saw your post while searching for more information related to rare events logistic regressions. Thank you for the explanation, but why not zip regression?

  19. John Burton says:

    Dear Dr Allison,

    Is there a threshold that one should adhere to for an independent variable to be used for LR , in terms of ratio of two categories within the independent categorical variable. e.g. If I am trying to assess that in a sample size of 100 subjects, gender is a predictor of getting an infection (coded as 1), but 98 subjects are male and only 2 are females, will the results be reliable due to such disparity between the two categories within the independent categorical variables. [The event rate to variable ratio is set flexibly at 5].

    thank you for your advice.

    regards

    John

    • Paul Allison says:

      With only 2 females, you will certainly not be able to get reliable estimates of sex differences. That should be reflected in the standard error for your sex coefficient.

  20. F. says:

    Dear Dr. Allison,

    I have a slightly different problem but maybe you have an idea. I use multinomial logit model. One value of the dependent variable has 100 events, the other 4000 events. The sample sice is 1 900 000. I am thinking the 100 events could be to little.

    Thank you!

    • Paul Allison says:

      100 might be OK if you don’t have a large number of predictors. But don’t make this category the reference category.

      • F. says:

        Thank you,

        I am using about ten predictors; would you consider this a low number in this case?

        in general: is there an easy to implement way to deal with rare events in a multinomial logit model?

  21. Juliette C says:

    Dear Dr. Allison,

    I have a population of 810,000 cases with 500 events. I would like to use logit model. I am using about 10 predictors. If I did a logic regression, it could be done goods results in the coefficients estimations (especially for constant term)?

    Thank you!

    • Paul Allison says:

      I see no problem with this. You can judge the quality of the constant term estimate by its confidence interval.

      • Juliette C says:

        I don’t understand because I read in the article https://files.nyu.edu/mrg217/public/binaryresponse.pdf (page 38 talking about king and Zeng’s article) that “logit coefficients can lead us to underestimate the probability of an event even with sample sizes in the thousands when we have rare events data”. In fact, they explain constant term is affected (largely negative) but I think they talk also of biased’s coefficients (page 42).

        Also, we can read a lot of things about prior correction with rare event for samples. I am wondering what the interest of this correction? Why should we use a sample rather than the whole population available if the estimates are biaised in both cases?

        • Paul Allison says:

          As I said in my post, what matters for bias is not the rarity of events (in terms of a small proportion) but the number of events that are actually observed. If there is concern about bias, the Firth correction is very useful and readily available. I do not believe that undersampling the non-events is helpful in this situation.

  22. Karen says:

    Dr. Allison–
    Thank you very much for this helpful post. I am analyzing survey data using using SAS. I am looking at sexual violence and there are only 144 events. Although the overall sample is quite large (over 18,000), due to skip patterns in the survey, I looking at a subpopulation of only sexually active males (the only ones in the survey asked the questions of interest). The standard errors for the overall sample look excellent, but when applying subpopulation analysis the standard errors are large. Do you have any suggestions to address this? I believe that I can’t use the Firth method in this case because I use SAS and it doesn’t seem to be available for Proc Surveylogistic.

    Thank you.
    –Karen

  23. Rich says:

    According to Stata Manual on the complementary log-log, “Typically, this model is used when the positive (or negative) outcome is rare” but there isn’t much explanation provided.

    I tried looking up a few papers and textbooks about clog-log but most simply talk about the asymmetry property.

    Can we use clog-log for rare event binary outcome? Which is preferred?

  24. M says:

    Dear Dr. Allison,

    I have a sample with 5 events out of 1500 total sample. Is it possible to perform logistics regression with this sample (I have 5 predictors)? Do you know if Firth method is available with SPSS?

    Thank you.

    • Paul Allison says:

      Not much you can do with just five events. Even a single predictor could be problematic. I’d go with exact logistic regression, not Firth. As far as I know, Firth is not available in SPSS.

  25. Scott says:

    (Correction2 – I sincerely apologize for my errors – the following is a correct and complete version of my question)
    Dr. Allison,
    I have a sample of 7108 with 96 events. I would like to utilize logistic regression and seem to be OK with standard errors. However, when analyzing standardized residuals for outliers, all but 5 of the 96 cases positive for the event have a SD>1.96. I have a few questions:
    1) Is 96 events sufficient for logistic regression?
    2) With 96 events, how many predictors would you recommend?
    3) In that rare events analysis is really analysis of outliers, how do you deal with identifying outliers in such a case?
    Thank you.

    • Paul Allison says:

      1. Yes, 96 events is sufficient.
      2. I’d recommend no more than 10 predictors.
      3. I don’t think standardized residuals are very informative in a case like this.

  26. PN says:

    I have data set of about 60,000 observations with 750 event cases. I have 5 predictor variables. When I run the logistic regression I get all the predictors as significant. The Concordant pairs are about 80%. However, the over all model fit is not significant. Any suggestions to deal with this?

    • Paul Allison says:

      It’s rather surprising that all 5 predictors would be significant (at what level?) but the overall model fit is not significant. Send me your output.

  27. John says:

    Hi Dr. Allison,

    You have mentioned that 2000 events out of 100,000 is a good sample for logistic regression, which is 98% – 2% split. I have been always suggested that we should have 80-20 or 70-30 split for logistic regression. And in case such split is not there than we should reduce the data. For example we should keep 2000 events and randomly select 8000 non-event observation and should run model on 10,000 records inplace of 100,000. Please suggest.

    • Paul Allison says:

      There is absolutely no requirement that there be an 80-20 split or better And deleting cases to achieve that split is a waste of data.

  28. Bernhard Schmidt says:

    Dear Dr. Allison,

    I have data of 41 patients with 6 events (=death). I am studying the prognostic value of a diagnostic parameter (DP) (numerical) for outcome (survival/death).
    In a logistic regression outcome vers DP, DB was significant. However, I like to clarify whether this prognostic value is independant from age, and 3 other dichotomic parameters (gender disease, surgery). In a multiple logistic regression DP was the only significant parameter out of these 5. But I was told the event/no-of-parameters ratio should be at least 5. Therefore, this result has no meaning. Is there any method which could help coming closer to an answer? Or is it simply not enough data (unfortunately, small population is a common problem in clinic studies) Thank you very much for any suggestion.
    Bernhard

    • Paul Allison says:

      Try exact logistic regression, available in SAS, Stata, and some other packages. This is a conservative method, but it has no lower bound on the number of events. You may not have enough data to get reliable results, however.

  29. Kelly says:

    I have a rare predictor (n=41)and a rare outcome. Any guidelines on how may events are needed for the predictor? (Or, the n in a given chi-square cell?)

    Thanks so much!

    • Paul Allison says:

      Check the 2 x 2 table and compute expected frequencies under the independence hypothesis. If they are all > 5 (the traditional rule of thumb) you should be fine.

  30. Alfonso says:

    Dear Colleagues, sorry to interrupt your discussion but I need of a help from experts.
    I am a young cardiologist and I am studying the outcome in patients with coronary ectasia during acute myocardial infarction (very rare condition). I have only 31 events (combined outcome for death, revascularization and myocardial infarction). after Univariate analysis I selected 5 variables. Is it possibile in your opinion to carry on a Cox regression analysis in this case?The EPV is only 31/5: 6.2
    Thanks

    • Paul Allison says:

      It’s probably worth doing, but you need to be very cautious about statistical inference. Your p-values (and confidence intervals) are likely to be only rough approximations. A more conservative approach would be to do exact logistic regression.

  31. Saurabh Tanwar says:

    Hi Dr. Allison,

    I am working on a rare event model with response rates of only 0.13% (300 events in a data sample of 200,000). I was reading through your comments above and you have stressed that what matters is the number of the rarer event, not the proportion. Can we put “minimum number of events” data must have for modeling.

    In my case, I am asking this as I do have an option of adding more data to increase the number of events(however the response rate will remain the same 0.13%). How many events will be sufficient?

    Also, what should be the best strategy here. Stratified sampling or Firth method?

    Thanks,
    Saurabh

    • Paul Allison says:

      A frequently mentioned but very rough rule of thumb is that you should have at least 10 events for each parameter estimated. The Firth method is usually good. Stratified sampling (taking all events and a simple random sample of the non-events) is good for reducing computation time when you have an extremely large data set. In that method, you want as many non-events as you can manage.

  32. Joy says:

    Hi Paul! i’ve been reading this trail and i also encounter problems in modeling outcomes for rare events occurring at 10% in the population we’re studying. One option that we did to get the unique behaviour is to get equal samples from outcomes and non outcomes. Just to determine the behavior to predict such outcomes. But when we ran the logistic model, we did not apply any weight to bring the results to be representative of the population. Is this ok? Am really not that happy with the accuracy rate of the model only 50% among predicted to result to the outcome had the actual outcome. Is our problem just a function of the equal sampling proportion? And will the firth method help to improve our model? Hope to get good insights /reco from you… Thanks!

    • Paul Allison says:

      Unless you’re working with very large data sets where computing time is an issue, there’s usually nothing to be gained by sampling to get equal fractions of events and non-events. And weighting such samples to match the population usually makes things worse by increasing the standard errors. As I tried to emphasize in the blog, what’s important is the NUMBER of rare events, not the fraction of rare events. If the number of rare events is substantial (relative to the number of predictors), the Firth method probably won’t help much.

      • Joy says:

        Hi, thank you so much for your response. We’re working indeed with very large data. We need to sample to make computing time more efficient. I understand that what matters are the number of rare events and not the fraction, that’s why we made sure that we have a readable sample of the events. But I feel that the problem of accuracy of predicting the event is because of the equal number of events and non events used in the model. Is this valid? And yes, applying weights did no good. It made the model results even worse. For the model build for my base, should I just use random sampling of my entire population and just make sure that I have a readable base of my events?

        • Paul Allison says:

          When sampling rare events from a large data base, you get the best estimates by taking all of the events and a random sample of the non-events. The number of non-events should be at least equal to the number of events, but the more non-events you can afford to include, the better. When generating predicted probabilities, however, you should adjust for the disproportionate sampling. In SAS, this can be done using the PRIOREVENT option on the SCORE statement.

  33. J says:

    Since it sounds like the bias relates to maximum likelihood estimation, would Bayesian MCMC estimation methods also be biased?

    • Paul Allison says:

      Good question, but I do not know the answer.

      • Sam, applied med stats. says:

        Is this a relevant article?

        Mehta, Cyrus R., Nitin R. Patel, and Pralay Senchaudhuri. “Efficient Monte Carlo methods for conditional logistic regression.” Journal of The American Statistical Association 95, no. 449 (2000): 99-108.

        • Paul Allison says:

          This article is about computational methods for doing conditional logistic regression. It’s not really about rare events.

  34. Aaron says:

    Can you use model fit statistics from SAS such as the AIC and -2 log likelihood to compare models when penalized likelihood estimation with the firth method is used?

    • Paul Allison says:

      I believe that the answer is yes, although I haven’t seen any literature that specifically addresses this issue.

  35. Paul B says:

    I know you’ve answered this many times above regarding logistic regression and discrete-time models — that if you have a huge number of observations, then it is best to take all of the events and a simple random sample of all of the non-events which is at least as large as the number of events. My question is: Does this advice apply also to continuous time models, specifically the Cox PH with time-varying covariates? I ask because I have a dataset with 2.8 million observations, 3,000 of which are events. Due to the many time-varying covariates and other fixed covariates (about 10 of each), we had to split the data into counting process format, so the 3,000 events have become 50,000 rows. Thus, our computing capabilities are such that taking a simple random sample from the non-events that is 15,000 (which become about 250,000 rows) and running these in PHREG with the events takes considerable computing time (it uses a combination of counting process format AND programming statements). Long story short, the question is – is 15,000 enough? And what corrections need to be made to the results when the model is based on a SRS of the non-events?

    • Paul Allison says:

      I think 15,000 is enough, but the methodology is more complex with Cox PH. There are two approaches: the nested case-control method and the case-cohort method. The nested case-control method requires a fairly complicated sampling design, but the analysis is (relatively) straightforward. Sampling is relatively easy with the case-cohort method, but the analysis is considerably more complex.

      • Paul B says:

        Thank you so much for the quick response! I really appreciate the guidance. I’ve just been doing some reading about both of these methods and your concise summary of the advantages and disadvantages of each approach is absolutely right on. I wanted to share, in case others are interested, two good and easy-to-understand articles on these sampling methodologies which I found: “Comparison of nested case-control and survival analysis methodologies for analysis of time-dependent exposure”, Vidal Essebag, et al. and “Analysis of Case-Cohort Designs”, William E. Barlow, et. al.

Leave a Reply