## Logistic Regression for Rare Events

##### February 13, 2012 By Paul Allison

Prompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare. Although King and Zeng accurately described the problem and proposed an appropriate solution, there are still a lot of misconceptions about this issue.

The problem is not specifically the *rarity* of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.

There’s nothing wrong with the logistic *model* in such cases. The problem is that maximum likelihood estimation of the logistic model is well-known to suffer from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. So even with a sample size of 100,000, if there are only 20 *events* in the sample, you may have substantial bias.

What’s the solution? King and Zeng proposed an alternative estimation method to reduce the bias. Their method is very similar to another method, known as penalized likelihood, that is more widely available in commercial software. Also called the Firth method, after its inventor, penalized likelihood is a general approach to reducing small-sample bias in maximum likelihood estimation. In the case of logistic regression, penalized likelihood also has the attraction of producing finite, consistent estimates of regression parameters when the maximum likelihood estimates do not even exist because of complete or quasi-complete separation.

Unlike exact logistic regression (another estimation method for small samples but one that can be very computationally intensive), penalized likelihood takes almost no additional computing time compared to conventional maximum likelihood. In fact, a case could be made for *always* using penalized likelihood rather than conventional maximum likelihood for logistic regression, regardless of the sample size. Does anyone have a counter-argument? If so, I’d like to hear it.

You can learn more about penalized likelihood in my seminar Logistic Regression Using SAS*. *

Reference:

Gary King and Langche Zeng. “Logistic Regression in Rare Events Data.” *Political Analysis* 9 (2001): 137-163.

I am thinking to use Poisson regression in case where event is rare, since p (probability of success) is very small and n (sample size is large).

This has no advantage over logistic regression. There’s still small sample bias if the number of events is small. Better to use exact logistic regression (if computationally practical) or the Firth method.

Can you please explain further why you say Poisson regression has no advantage over logistic regression when we have rare events? Thanks.

When events are rare, the Poisson distribution provides a good approximation to the binomial distribution. But it’s still just an approximation, so it’s better to go with the binomial distribution, which is the basis for logistic regression.

Is this the case with PHREG as well? If you have 50 events for 2000 observations, will using the firth option the appropriate one if your goal is to not only model likelihood but also the median time to event?

The Firth method can be helpful in reducing small-sample bias in Cox regression, which can arise when the number of events is small. The Firth method can also be helpful with convergence failures in Cox regression, although these are less common than in logistic regression.

I am interested to determine what are the significant factors associated an “outcome”, which is a binary variable in my sample.My sample size from a cross-sectional survey is 20,000 and the number of respondents with presence of “outcome” is 70. Which method would be appropriate, multiple logistic or poisson regression?

Thanks.

There is no reason to consider Poisson regression. For logistic regression, I would use the Firth method.

“Does anyone have a counter-argument? If so, I’d like to hear it.”

I usually default to using Firth’s method, but in some cases the true parameter really is infinite. If the response variable is presence of testicular cancer and one of the covariates is sex, for example. In that case, it’s obvious that sex should not be in the model, but in other cases it might not be so obvious, or the model might be getting fit as part of an automated process.

On a different note, I have read in Paul’s book that when there is a proportionality violation, creating time-varying covariates with the main predictor, and testing for its significance is both the diagnosis and the cure.

So, if the IV is significant after the IV*duration is also significant, then, are we ok to interpret the effect?

How does whether the event is rare or not affect the value of the above procedure?

Yes, if the IV*duration is significant, you can go ahead and interpret the “effect” which will vary with time. The rarity of the event reduces the power of this test.

I fully agree with Paul Allison. We have done extensive simulation studies with small samples, comparing the Firth method with ordinary maximum likelihood estimation. Regarding point estimates, the Firth method was always superior to ML. Furthermore, it turned out that confidence intervals based on the profile penalized likelihood were more reliable in terms of coverage probability than those based on standard errors. Profile penalized likelihood confidence intervals are available, e.g., in SAS/PROC LOGISTIC and in the R logistf package.

Hi,

I am a phD student at biostatistics. I have a data set with approximately 26000 cases where there are only 110 events. I used the method of weighting for rare events in Gary King article. My goal was to estimate ORs in a logistic regression,unfortunetly standard errors and confidence intervals are big , and there is a little difference with usual logistic regression. I dont no why, what is your idea? can I use penalized likelihood?

My guess is that penalized likelihood will give you very similar results. 110 events is enough so that small sample bias is not likely to be a big factor–unless you have lots of predictors, say, more than 20. But the effective sample size here is a lot closer to 110 than it is to 26,000. So you may simply not have enough events to get reliable estimates of the odds ratios. There’s no technical fix for that.

Paul,

Please clear me this. I have the sample of 16000 observations with equal number of good and bads. Is it good way of building the model or should I reduce the bads.

Don’t reduce the bads. There would be nothing to gain in doing that, and you want to use all the data you have.

Hi Dr. Allison,

If the event I am analyzing is extremely rare (1 in 1000) but the available sample is large (5 million) such that there are 5000 events in the sample, would logistic regression be appropriate? There are about 15-20 independent variables that are of interest to us in understanding the event. If an even larger sample would be needed, how much larger should it be at a minimum?

If logistic regression is not suitable, what are our options to model such an event?

Thanks,

Adwait

Yes, logistic regression should be fine in this situation. Again, what matters is the number of the rarer event, not the proportion.

Hi Dr. Allison,

I have a small data set (100 patients), with only 25 events. Because the dataset is small, I am able to do an exact logistic regression. A few questions…

1. Is there a variable limit for inclusion in my model? Does the 10:1 rule that is often suggested still apply?

2. Is there a “number” below which conventional logistic regression is not recommended…i.e. 20?

Thanks and take care.

1. I’d probably be comfortable with the more “liberal” rule of thumb of 5 events per predictor. Thus, no more than 5 predictors in your regression.

2. No there’s no lower limit, but I would insist on exact logistic regression for accurate p-values.

Dr. Allison,

I benefited a lot from your explanation of Exact logistic regression and I read your reply on this comment that you would relax the criteria to only 5 events per predictor instead of 10. I am in this situation right now and I badly need your help. I will have to be able to defend that and I wanna know if there is evidence behind the relaxed 5 events per predictor rule with exact regression?

Thanks a lot.

Below are two references that you might find helpful. One argues for relaxing the 10 events per predictor rule, while the other claims that even more events may be needed. Both papers focus on conventional ML methods rather than exact logistic regression.

Vittinghoff, E. and C.E. McCulloch (2006) “Relaxing the rule of ten events per variable in logistic and Cox regression.” American Journal of Epidemiology 165: 710-718.

Courvoisier, D.S., C. Combescure, T. Agoritsas, A. Gayet-Ageron and T.V. Perneger (2011) “Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure.” Journal of Clinical Epidemiology 64: 993-1000.

Hello again,

I also wanted to confirm this from you, that if I have the gender as a predictor (male, female), this is considered as TWO and not one variables, right?

Thanks.

Gender is only one variable.

Thank you very much for your help. I guess I gave you a wrong example for my question. I wanted to know if a categorical variable has more than two levels, would it still be counted as one variable for the sake of the rule we are discussing?

Also, do we have to stick to the 5 events per predictor if we use Firth, or can we violate the rule completely, and if it is OK to violate it, do I have to mention a limitation about that?

Sorry for the many questions.

Thanks

What matters is the number of coefficients. So a categorical variable with 5 categories would have four coefficients. Although I’m not aware of any studies on the matter, my guess is that the same rule of thumb (of 5 or 10 events per coefficient), would apply to the Firth method. Keep in mind, however, that this is only the roughest rule of thumb. It’s purpose is to ensure that the asymptotic approximations (consistency, efficiency, normality) aren’t too bad. But it is not sufficient to determine whether the study has adequate power to test the hypotheses of interest.

Hi Dr. Allison,

You mention in your original post that if a sample has 100,000 cases with 2,000 events, you’re golden. My question is this: from that group of 100,000 cases with 2,000 or so events, what is the appropriate sample size for analysis? I am working with a population of about 100,000 cases with 4,500 events; I want to select a random sample from this, but don’t want the sample to be too small (want to ensure there are enough events in the analysis). A second follow up question – is it ok for my cutoff value in logistic regression to be so low (around 0.04 or so?)

Thank so much for any help you can provide!

Joe

My question is, do you really need to sample? Nowadays, most software packages can easily handle 100,000 cases for logistic regression. If you definitely want to sample, I would take all 4500 cases with events. Then take a simple random sample of the non-events. The more the better, but at least 4500. This kind of disproportionate stratified sampling on the dependent variable is perfectly OK for logistic regression (see Ch. 3 of my book Logistic Regression Using SAS). And there’s no problem with only .04 of the original sample having events. As I said in the blog post, what matters is the number of events, not the proportion.

Dear Paul

I am using a data set of 86,000 observations to study business start-up. The most of the responses are dichotomous. Business start-up rate is 5% which is dependent variable. I used logistic regression and result shows all 10 independent variables are highly significant. I tried rare event and got same result. People are complaining for highly significant result and saying the result may be biased. Would you please suggest me?

Regards

Given what you’ve told me, I think your critics are being unreasonable.

I am going to analyze a situation where there are 97 non-events and only 3 events… i will try rare-events logistic as well as bayesian logistic…

With only three events, no technique is going to be very reliable. I would probably focus on exact logistic regression.

I am looking at comparing trends in prescription rates over time from a population health database. The events are in the range of 1500 per 100000 people +/- each of 5 years.

The Cochrane Armitage test for trend or logistic regression always seem to be significant even though event rate is going from 1.65 to 1.53. Is there a better test I should be performing or is this just due to large population numbers yielding high power?

thank you,

It’s probably the high power.

Dear Dr. Allison,

I have an unbalanced panel data on low birth weight kids. I am interested in evaluating the probability of hospital admissions (per 6-months) between 1 to 5 years of age. Birth weight categories are my main predictor variables of interest, but I would also want to account for their time varying effects, by interacting BW categories with age-period. The sample size of the cohort at age1 is ~51,000 but the sample size gets reduced to 19,000 by age5. Hospital admissions in the sample at yrs 1 and 5 are respectively 2,246 and 127. Are there issues in using the logistic procedure in the context of an unbalanced panel data such as the one I have ? Please provide your thoughts as they may apply to 1)pooled logistic regression using cluster robust SE and 2)using a fixed/random effects panel approach ? Many thanks in advance.

Best,

Vaidy

Regarding the unbalanced sample, a lot depends on why it’s unbalanced. If it’s simply because of the study design (as I suspect), I wouldn’t worry about it. But if it’s because of drop out, then you have to worry about the data not being missing completely at random. If that’s the case, maximum likelihood methods (like random effects models) have the advantage over simply using robust standard errors. Because FE models are also ML estimates, they should have good properties also.

Dr.Allison,

Thanks for your response. I guess I am saying I have two different issues here with my unbalanced panel: 1)the attrition issue that you rightly brought up; 2) i am concerned about incidental parameters problem by using fixed/random effects logistic regression with heavily attrited data. I ran some probit models to predict attrition and it appears that attrition in my data is mostly random. Is the second issue regarding incidental parameters problem really of concern ? Each panel in my data is composed of minimum two waves. Thanks.

First, it’s not possible to tell whether your attrition satisfies the missing at random condition. MAR requires that the probability of a datum being missing does not depend on the value of that datum. But if you don’t observe it, there’s not way to tell. Second, incidental parameters are not a problem if you estimate the fixed effects model by way of conditional likelihood.

Thanks for clarifying about the incidental parameters problem. I get your point about the criteria for MAR, that the missigness should not depend on the value of the datum. Key characteristics that could affect attrition are not observed in my data (e.g. SES, maternal characteristics, family income etc.). If there is no way to determine MAR, will it be fine to use a weighting procedure based on the theory of selection on observables ? For e.g. Fitzgerald and Moffit (1998) developed an indirect method to test attrition bias in panel data by using lagged outcomes to predict non-attrition. They call the lagged outcomes as auxillary variables. I ran probit regressions using different sets of lagged outcomes (such as lagged costs, hospitalization status, disability status etc.)and none of the models predicted >10% variation in non-attrition. This essentially means that attrition is probably not affected by observables. But should I still weight my observations in the panel regressions using the predicted probabilities of non-attrition from the probit models ?

Of course, I understand that this still does not address selection on unobservables [and hence your comment about I cannot say that data is missing at random].

Thanks,

Vaidy

MAR allows for selection on observables. And ML estimates of fixed and random effects automatically adjust for selection on observables, as long as those observables are among the variables in the model. So there’s no need to weight.

Dr. Allison,

I’m wondering your thoughts on this off-the-cuff idea: Say I have 1000 samples and only 50 cases. What if I sample 40 cases and 40 controls, and fit a logistic regression either with a small number of predictors or with some penalized regression. Then predict the other 10 cases with my coefficients, save the MSE, and repeat the sampling, many, many times (say, B). Then build up an estimate for the ‘true’ coefficients based on a weighted average of the B inverse MSEs and beta vectors. ok idea or hugely biased?

I don’t see what this buys you beyond what you get from just doing the single logistic regression on the sample of 1000 using the Firth method.

Hi Dr.Allison ,

In the case of rare event logistic regressions ( sub 1% ) , would the pseudo R2( Cox and Snell etc ) be a reliable indicator of the model fit since the upper bound of the same depends on the overall probability of occurrence of the event itself. Would a low R2 still represent a poor model ? I’m assuming the confusion matrix may no longer be a great indicator of the model accuracy either ….

Thanks

McFadden’s R2 is probably more useful in such situations than the Cox-Snell R2. But I doubt that either is very informative. I certainly wouldn’t reject a model in such situations just because the R2 is low.

Dear Dr. Allison,

I am analyzing the binary decisions of 500,000 individuals across two periods (so one million observations total). There were 2,500 successes in the first period, and 6,000 in the second. I estimate the effects of 20 predictors per period (40 total). For some reason, both logit and probit models give me null effects to variables that are significant under a linear probability model.

Any thoughts on why this might be the case? Thanks very much.

Good question. Maybe the LPM is reporting inaccurate standard errors. Try estimating the LPM with robust standard errors.

Thanks so much for the suggestion. I did use robust standard errors (the LPM requires it as it fails the homoskedasticity assumption by construction), and the variables are still significant under the LPM. I recall reading somewhere that the LPM and logit/probit may give different estimates when modeling rare events, but cannot find a reference supporting this or intuit myself why this might by the case.

It does seem plausible that results from LPM and logit would be most divergent when the overall proportion of cases is near 1 or 0, because that’s where there should be most discrepancy between a straight line and the logistic curve. I have another suggestion: check for multicollinearity in the variable(s) that are switching significance. Seemingly minor changes in specification can have major consequences when there is near-extreme collinearity.

Thanks, and so sorry for the late reply. I think you are right that collinearity may be responsible for the difference. In my analysis, I aim to find the incremental effect of several variables in the latter period (post-treatment) above and beyond effects in the eariler period (pre-treatment). Every variable thus enters my model twice, once alone and once interacted with a period indicator. The variables are, of course, very correlated to themselves interacted with the indicator. Thanks again!

Dr.Allison,

I appreciate your comments on this topic, I want to know is there any articles about the influence of the events of independent variables ? Thanks a lot.

Sorry, but I don’t know what you mean by “events of independent variables.”

Paul, I saw your post while searching for more information related to rare events logistic regressions. Thank you for the explanation, but why not zip regression?

Dear Dr Allison,

Is there a threshold that one should adhere to for an independent variable to be used for LR , in terms of ratio of two categories within the independent categorical variable. e.g. If I am trying to assess that in a sample size of 100 subjects, gender is a predictor of getting an infection (coded as 1), but 98 subjects are male and only 2 are females, will the results be reliable due to such disparity between the two categories within the independent categorical variables. [The event rate to variable ratio is set flexibly at 5].

thank you for your advice.

regards

John

With only 2 females, you will certainly not be able to get reliable estimates of sex differences. That should be reflected in the standard error for your sex coefficient.

Dear Dr. Allison,

I have a slightly different problem but maybe you have an idea. I use multinomial logit model. One value of the dependent variable has 100 events, the other 4000 events. The sample sice is 1 900 000. I am thinking the 100 events could be to little.

Thank you!

100 might be OK if you don’t have a large number of predictors. But don’t make this category the reference category.

Thank you,

I am using about ten predictors; would you consider this a low number in this case?

in general: is there an easy to implement way to deal with rare events in a multinomial logit model?

Should be OK.

Dear Dr. Allison,

I have a population of 810,000 cases with 500 events. I would like to use logit model. I am using about 10 predictors. If I did a logic regression, it could be done goods results in the coefficients estimations (especially for constant term)?

Thank you!

I see no problem with this. You can judge the quality of the constant term estimate by its confidence interval.

I don’t understand because I read in the article https://files.nyu.edu/mrg217/public/binaryresponse.pdf (page 38 talking about king and Zeng’s article) that “logit coefficients can lead us to underestimate the probability of an event even with sample sizes in the thousands when we have rare events data”. In fact, they explain constant term is affected (largely negative) but I think they talk also of biased’s coefficients (page 42).

Also, we can read a lot of things about prior correction with rare event for samples. I am wondering what the interest of this correction? Why should we use a sample rather than the whole population available if the estimates are biaised in both cases?

As I said in my post, what matters for bias is not the rarity of events (in terms of a small proportion) but the number of events that are actually observed. If there is concern about bias, the Firth correction is very useful and readily available. I do not believe that undersampling the non-events is helpful in this situation.

Dr. Allison–

Thank you very much for this helpful post. I am analyzing survey data using using SAS. I am looking at sexual violence and there are only 144 events. Although the overall sample is quite large (over 18,000), due to skip patterns in the survey, I looking at a subpopulation of only sexually active males (the only ones in the survey asked the questions of interest). The standard errors for the overall sample look excellent, but when applying subpopulation analysis the standard errors are large. Do you have any suggestions to address this? I believe that I can’t use the Firth method in this case because I use SAS and it doesn’t seem to be available for Proc Surveylogistic.

Thank you.

–Karen

How many events in your subpopulation? There may not be much you can do about this.

According to Stata Manual on the complementary log-log, “Typically, this model is used when the positive (or negative) outcome is rare” but there isn’t much explanation provided.

I tried looking up a few papers and textbooks about clog-log but most simply talk about the asymmetry property.

Can we use clog-log for rare event binary outcome? Which is preferred?

I’m not aware of any good reason to prefer complementary log-log over logit in rare event situations.

Dear Dr. Allison,

I have a sample with 5 events out of 1500 total sample. Is it possible to perform logistics regression with this sample (I have 5 predictors)? Do you know if Firth method is available with SPSS?

Thank you.

Not much you can do with just five events. Even a single predictor could be problematic. I’d go with exact logistic regression, not Firth. As far as I know, Firth is not available in SPSS.

(Correction2 – I sincerely apologize for my errors – the following is a correct and complete version of my question)

Dr. Allison,

I have a sample of 7108 with 96 events. I would like to utilize logistic regression and seem to be OK with standard errors. However, when analyzing standardized residuals for outliers, all but 5 of the 96 cases positive for the event have a SD>1.96. I have a few questions:

1) Is 96 events sufficient for logistic regression?

2) With 96 events, how many predictors would you recommend?

3) In that rare events analysis is really analysis of outliers, how do you deal with identifying outliers in such a case?

Thank you.

1. Yes, 96 events is sufficient.

2. I’d recommend no more than 10 predictors.

3. I don’t think standardized residuals are very informative in a case like this.

I have data set of about 60,000 observations with 750 event cases. I have 5 predictor variables. When I run the logistic regression I get all the predictors as significant. The Concordant pairs are about 80%. However, the over all model fit is not significant. Any suggestions to deal with this?

It’s rather surprising that all 5 predictors would be significant (at what level?) but the overall model fit is not significant. Send me your output.

Hi Dr. Allison,

You have mentioned that 2000 events out of 100,000 is a good sample for logistic regression, which is 98% – 2% split. I have been always suggested that we should have 80-20 or 70-30 split for logistic regression. And in case such split is not there than we should reduce the data. For example we should keep 2000 events and randomly select 8000 non-event observation and should run model on 10,000 records inplace of 100,000. Please suggest.

There is absolutely no requirement that there be an 80-20 split or better And deleting cases to achieve that split is a waste of data.

Dear Dr. Allison,

I have data of 41 patients with 6 events (=death). I am studying the prognostic value of a diagnostic parameter (DP) (numerical) for outcome (survival/death).

In a logistic regression outcome vers DP, DB was significant. However, I like to clarify whether this prognostic value is independant from age, and 3 other dichotomic parameters (gender disease, surgery). In a multiple logistic regression DP was the only significant parameter out of these 5. But I was told the event/no-of-parameters ratio should be at least 5. Therefore, this result has no meaning. Is there any method which could help coming closer to an answer? Or is it simply not enough data (unfortunately, small population is a common problem in clinic studies) Thank you very much for any suggestion.

Bernhard

Try exact logistic regression, available in SAS, Stata, and some other packages. This is a conservative method, but it has no lower bound on the number of events. You may not have enough data to get reliable results, however.

I have a rare predictor (n=41)and a rare outcome. Any guidelines on how may events are needed for the predictor? (Or, the n in a given chi-square cell?)

Thanks so much!

Check the 2 x 2 table and compute expected frequencies under the independence hypothesis. If they are all > 5 (the traditional rule of thumb) you should be fine.

Dear Colleagues, sorry to interrupt your discussion but I need of a help from experts.

I am a young cardiologist and I am studying the outcome in patients with coronary ectasia during acute myocardial infarction (very rare condition). I have only 31 events (combined outcome for death, revascularization and myocardial infarction). after Univariate analysis I selected 5 variables. Is it possibile in your opinion to carry on a Cox regression analysis in this case?The EPV is only 31/5: 6.2

Thanks

It’s probably worth doing, but you need to be very cautious about statistical inference. Your p-values (and confidence intervals) are likely to be only rough approximations. A more conservative approach would be to do exact logistic regression.

Hi Dr. Allison,

I am working on a rare event model with response rates of only 0.13% (300 events in a data sample of 200,000). I was reading through your comments above and you have stressed that what matters is the number of the rarer event, not the proportion. Can we put “minimum number of events” data must have for modeling.

In my case, I am asking this as I do have an option of adding more data to increase the number of events(however the response rate will remain the same 0.13%). How many events will be sufficient?

Also, what should be the best strategy here. Stratified sampling or Firth method?

Thanks,

Saurabh

A frequently mentioned but very rough rule of thumb is that you should have at least 10 events for each parameter estimated. The Firth method is usually good. Stratified sampling (taking all events and a simple random sample of the non-events) is good for reducing computation time when you have an extremely large data set. In that method, you want as many non-events as you can manage.

Hi Paul! i’ve been reading this trail and i also encounter problems in modeling outcomes for rare events occurring at 10% in the population we’re studying. One option that we did to get the unique behaviour is to get equal samples from outcomes and non outcomes. Just to determine the behavior to predict such outcomes. But when we ran the logistic model, we did not apply any weight to bring the results to be representative of the population. Is this ok? Am really not that happy with the accuracy rate of the model only 50% among predicted to result to the outcome had the actual outcome. Is our problem just a function of the equal sampling proportion? And will the firth method help to improve our model? Hope to get good insights /reco from you… Thanks!

Unless you’re working with very large data sets where computing time is an issue, there’s usually nothing to be gained by sampling to get equal fractions of events and non-events. And weighting such samples to match the population usually makes things worse by increasing the standard errors. As I tried to emphasize in the blog, what’s important is the NUMBER of rare events, not the fraction of rare events. If the number of rare events is substantial (relative to the number of predictors), the Firth method probably won’t help much.

Hi, thank you so much for your response. We’re working indeed with very large data. We need to sample to make computing time more efficient. I understand that what matters are the number of rare events and not the fraction, that’s why we made sure that we have a readable sample of the events. But I feel that the problem of accuracy of predicting the event is because of the equal number of events and non events used in the model. Is this valid? And yes, applying weights did no good. It made the model results even worse. For the model build for my base, should I just use random sampling of my entire population and just make sure that I have a readable base of my events?

When sampling rare events from a large data base, you get the best estimates by taking all of the events and a random sample of the non-events. The number of non-events should be at least equal to the number of events, but the more non-events you can afford to include, the better. When generating predicted probabilities, however, you should adjust for the disproportionate sampling. In SAS, this can be done using the PRIOREVENT option on the SCORE statement.

Since it sounds like the bias relates to maximum likelihood estimation, would Bayesian MCMC estimation methods also be biased?

Good question, but I do not know the answer.

Is this a relevant article?

Mehta, Cyrus R., Nitin R. Patel, and Pralay Senchaudhuri. “Efficient Monte Carlo methods for conditional logistic regression.” Journal of The American Statistical Association 95, no. 449 (2000): 99-108.

This article is about computational methods for doing conditional logistic regression. It’s not really about rare events.

Can you use model fit statistics from SAS such as the AIC and -2 log likelihood to compare models when penalized likelihood estimation with the firth method is used?

I believe that the answer is yes, although I haven’t seen any literature that specifically addresses this issue.

I know you’ve answered this many times above regarding logistic regression and discrete-time models — that if you have a huge number of observations, then it is best to take all of the events and a simple random sample of all of the non-events which is at least as large as the number of events. My question is: Does this advice apply also to continuous time models, specifically the Cox PH with time-varying covariates? I ask because I have a dataset with 2.8 million observations, 3,000 of which are events. Due to the many time-varying covariates and other fixed covariates (about 10 of each), we had to split the data into counting process format, so the 3,000 events have become 50,000 rows. Thus, our computing capabilities are such that taking a simple random sample from the non-events that is 15,000 (which become about 250,000 rows) and running these in PHREG with the events takes considerable computing time (it uses a combination of counting process format AND programming statements). Long story short, the question is – is 15,000 enough? And what corrections need to be made to the results when the model is based on a SRS of the non-events?

I think 15,000 is enough, but the methodology is more complex with Cox PH. There are two approaches: the nested case-control method and the case-cohort method. The nested case-control method requires a fairly complicated sampling design, but the analysis is (relatively) straightforward. Sampling is relatively easy with the case-cohort method, but the analysis is considerably more complex.

Thank you so much for the quick response! I really appreciate the guidance. I’ve just been doing some reading about both of these methods and your concise summary of the advantages and disadvantages of each approach is absolutely right on. I wanted to share, in case others are interested, two good and easy-to-understand articles on these sampling methodologies which I found: “Comparison of nested case-control and survival analysis methodologies for analysis of time-dependent exposure”, Vidal Essebag, et al. and “Analysis of Case-Cohort Designs”, William E. Barlow, et. al.

“Does anyone have a counter-argument?”

In the 2008 paper “a weakly informative default prior distribution for logistic and other regression models” by Gelman, Jakulin, Pittau and Su, a different fully Bayesian approach is proposed:

– shifting and scaling non-binary variables to have mean 0 and std dev 0.5

– placing a Cauchy-distribution with center 0 and scale 2.5 on the coefficients.

Cross-validation on a corpus of 45 data sets showed superior performance. Surprisingly the Jeffreys’ prior, i.e. Firth method, performed poorly in the cross-validation. The second-order unbiasedness of property of Jeffreys’ prior, while theoretically defensible, doesn’t make use of valuable prior information, notably that changes on the logistic scale are unlikely to be more that 5.

This paper has focused on solving the common problem of inifite ML estimates when there is complete separation, not so much on rare events per se. The corpus of 45 data sets are mostly reasonably balanced data sets with Pr(y=1) between 0.13 and 0.79.

Yet the poor performance of the Jeffreys’ prior in the cross-validation is striking. Its mean logarithmic score is actually far worse than that of conventional MLE (using glm).

I am in political science and wanted to use rare events logit in Stata, but it does not allow me to use fixed or random effects. After reading your work, I am not even sure my events are rare. Could you please let me know if I have a problem and how I might resolve it in Stata?

I have one sample with 7851 observations and 576 events. I have another sample with 6887 observations and 204 events.

I appreciate your advice.

Katherine

I don’t see any need to use rare event methods for these data.

I should have mentioned that I have 8 independent variables in my models.

Hi Dr. Allison,

Thanks for this post. I have been learning how to use logistic regression and your blog has been really helpful. I was wondering if we need to worry about the number of events in each category of a factor when using it as a predictor in the model. I’m asking this because I have a few factors set as independent variables in my model and some are highly unbalanced, which makes me worry that the number of events might be low in some of the categories (when size is low). For example, one variable has 4 categories and sizes range from 23 (15 events) to 61064! Total number of events is 45334 for a sample size of 83356. Thanks!

This is a legitimate concern. First of all, you wouldn’t want to use a category with a small number of cases as the reference category. Second, the standard errors of the coefficients for small categories will probably be high. These two considerations will apply to both linear and logistic regression. In addition, for logistic regression, the coefficients for small categories are more likely to suffer from small-sample bias. So if you’re really interested in those coefficients, you may want to consider the Firth method to reduce the bias.

Hi Dr. Allison,

When I have 20 events out of 1000 samples, if re-sampling like bootstrap method can help to improve estimation? Thanks very much !

I strongly doubt it.

Dr. Allison, it is great to get your reply, thanks very much. Could you help to explain why bootstrap can’t help when events are rare ? Besides, if I have 700 responders out of 60,000 samples and the variables in final model is 15, but the number of variables is 500 in the original varible selction process, do you think the 700 events are enough ? Thanks again !

What do you hope to accomplish by bootstrapping?

I want to increase the number of events by bootstrapping and thus the events are enough to make parameter estimation.

Bootstrapping can’t achieve that. What it MIGHT be able to do is provide a more realistic assessment of the sampling distribution of your estimates than the usual asymptotic normal distribution.

Hi Dr. Allison,

Iam working on natural resource management issues. In my project ‘yes’responses of my dependent variable are 80-85% while ‘no’ responses are 14-18%. Can I use Binary logit model here?

with Regards

S. Ray

Probably, but as I said in my post, what matters more is the number of “no”s, not the percentage.

Hi Paul!

I would be most grateful if you could help me with the following questions: 1) I have a logistic regression model with supposedly low power (65 events and ten independent variables). Several variables do however come out significant. Are these significance tests unreliable in any way?

And 2) do you know if it is possible to perform the penalized likelihood in SPSS?

They could be unreliable. In this case, I would try exact logistic regression. I don’t know if penalized likelihood is available in SPSS.

Dear Dr. Allison,

I am trying to build a logistic regression model for a dataset with 1.4 million records with the rare event comprising 50000 records. The number of variables is about 50 most of which are categorical variables which on an average about 4 classes each. I wanted to check with you if it is advisable to use the Firth method in this case.

Thank You

You’re probably OK with conventional ML, but check to see how many events there are in each category of each variable. If any of the numbers are small, say, less than 20, you may want to use Firth. And there’s little harm in doing so.

This is a nice discussion, but penalization is a much more general method than just the Firth bias correction, which is not always successful in producing sensible results. There are real examples in which the Firth method could be judged inferior (on both statistical and contextual grounds) to stronger penalization based on conjugate-logistic (log-F) priors. These general methods are easily implemented in any logistic-regression package by translating the penalty into prior data. For examples see Sullivan & Greenland (2013, Bayesian regression in SAS software. International Journal of Epidemiology, 42, 308-317. These methods have a frequentist justification in terms of MSE reduction (shrinkage) so are not just for Bayesians; see the application to sparse data and comparison to Firth on p. 313.

Thanks for the suggestions.

Dr. Paul Allison, I am very thankful to you for your post and the discussions followed, from which I have almost solved my problem except one. My event is out-migrant having 849 cases which is 1.2% of the total sample(69,207). Regarding the small proportion, I think my data is in the comfort zone to apply for logistic regression. But the dependent variable is highly skewed (8.86 skewness). Does it pose any problems, and if so, how can I take care of this? Reducing the number of non-events by taking random sample has been found helpful but I doubt whether it affects the actual characteristics of the population concerned. Plz clarify me on this. I use SPSS program. Thanks.

The skewness is not a problem. And I see know advantage in reducing the number of non-events by taking a random sample.

Dear Dr. Paul Allison, I understood we have to pay attention to small sample bias for small categories. But I have continuous independent variables, and 50 events over 90.000 cases (all times 11 years). If I use in a logit estimation, for example, 4 independent variables can I have some problems in the interpretation of their estimated coefficients and their significance? Thanks

I’d probably want to go with the Firth method, using p-values based on profile likelihood. To get more confidence in the p-values, you might even want to try exact logistic regression. Although the overall sample size is pretty large for exact logistic, the small number of events may make it feasible.

Dear Dr. Paul Allison, I would like to know which kind of logistic regression analysis shall I use if have 1500 samples and only 30 positives? Shall I use exact or firth? What would be the advantage of using either of them in the analysis?

Firth has the advantage of reducing small sample bias in the parameter estimates. Exact is better at getting accurate p-values (although they tend to be conservative). In your case I would do both: Firth for the coefficients and exact for the p-values (and/or confidence limits).

I think this situation is most similar to my own but I’d like to check if possible. I have an experiment that has 1 indepdendent variable with 3 levels, sample size of 30 in each condition. Condition 1 has 1 success/positive out of 30. Condition 2 has 4/30, and Condition 3 has 5/30. Can I rely on Firth or do I need both? (And is it acceptable to report coefficients from one but probability from another? I wouldn’t have guessed that would be ok.)

I don’t think you even need logistic regression here. You have a 3 x 2 table, and you can just do Fisher’s exact test, which is equivalent to doing exact logistic regression. I don’t think there’s much point in computing odds ratio, either, because they would have very wide confidence intervals. I’d just report the fraction of successes under each condition.

Dear Dr. Paul Allison,

In which case can we use 10% level of significance( p-value cut off point) instead of using 5%? For instance, if you have nine independent variables,and run univariate logistic regression, you find that the p-value for your three independent variables is below 10%. If you drop those variables which are above 10% (using 10% level of significance) and use firth to analyse your final model, you will end up with significant value(P<0.05) of the three variables. Is it possible to use this analysis and what would be the reason why you use 10% as cut off value?

I don’t quite understand the question. Personally, I would never use .10 as a criterion for statistical significance.

Dr. Allison, this is an excellent post with continued discussion. I am currently in debate with contractors who have ruled out 62 events in a sample of 1500 as too small to analyse empirically. Is 62 on the cusp of simple logistic regression or would the Firth method still be advisable? Further, is there a rule of thumb table available which describes minimum number of events necessary relative to sample and number of independent variables? Many thanks. Becky

It may not be too small. One very rough rule of thumb is that there should be at least 10 cases on the less frequent category for each coefficient in the regression model. A more liberal rule of thumb is at least 5 cases. I would try both Firth regression and exact logistic regression.

for a rare event example, 20 events in 10,000 cases, may we add multiple event(like 19 times the events, so that we can get 200 events) in the data. once we get the predicted probablity, we jsut need to adjust the probablity by the percentages(in this case 10/10000 -> 200/10200).

Or we may use boostrapping method to resample the data?

No, it’s definitely not appropriate to just duplicate your 20 events. And I can’t see any attraction to resampling. For p-values, I’d recommend exact logistic regression.

Exact logistic regression, rare events, and Firth method work well for binary outcomes. What would you suggest for rare continuous outcomes?

Say, I have multiple sources of income (20,000+ sources). Taken separately, each source throughout a year generates profit only on rare occasions. Each source could have 362 days of zero profit, and 3 days of positive profit. The number of profit days slightly vary from source to source.

I have collected daily profit values generated by each source into one data set. It looks like pooled cross sections. This profit is my dependent variable. Independent variables associated with it are also continuous variables.

Can you provide me any hints of which framework to use? (I tried tobit model that assumes left censoring.) Can I still use Firth or rare events?

Thanks.

Well, I’d probably treat this as a binary outcome rather than continuous: profit vs. no profit. Then, assuming that your predictors of interest are time-varying, I’d do conditional logistic regression, treating each income source as a different stratum. Although I can’t cite any theory, my intuition is that the rarity of the events would not be a serious problem in this situation.

Dr. Allison,

Hi. You may have already answered this from earlier threads, but is a sample size of 9000 with 85 events/occurrence considered a rare-event scenario? is logistic regression appropriate?

Many thanks.

Rob

Yes, it’s a rare event scenario, but conventional logistic regression may still be OK. If the number of predictors is no more than 8, you should be fine. But probably a good idea to verify your results with exact logistic regression and/or the Firth method.

Sorry, follow-up question… what’s the minimum acceptable c-stat… I usually hear .7, so if I get, say 0.67, should I consider a different modeling technique?

The c-stat is the functional equivalent of an R-squared. There is no minimum acceptable value. It all depends on what you are trying to accomplish. A different modeling technique is not necessarily going to do any better. If you want a higher c-stat, try getting better predictor variables.

Hello Mr.Allison,

I’m writing you because I have a similar problem. I have an unbalanced panel data with 23 time periods (the attrition is du to lose of indiv over periods). I would like to ask your opinion for 2 issues:

1. How can I do the regression, should I use the pooled data or panel data with FE/RE?

2. I also have a problem of rare events, for the pooled data I have almost 10000000 obs and only 45000 obs whit the event=1 (0.45%).What do you think I shold do in this case.

Thank you very much, I appreciate you help.

Stefan

1. I would do either fixed or random effects logistic regression.

2. With 45,000 events, you should be fine with conventional maximum likelihood methods.

First of all, thank you for your answears.

The problem is that when I do logistic regression for the pooled data I obtain a small Somers D (0.36) and my predicted probabilities are very small, even for the event=1 (The probabilities are nor bigger than 0.003). I don’t know what to do.

What do you think is the problem, and what can I do.

Thank you again.

Hello,

Paul,

I am currently doing my project for MSc, I have a dataset with 2312 observation with only 29 observations. I want to perform logistic association. Which method would you recommend?

I assume you mean 29 events. I’d probably use the Firth method to get parameter estimates. But I’d double-check the p-values and confidence intervals with conditional logistic regression. And I’d keep the number of predictors low–no more than 5, preferably fewer.

Dear Dr. Allison,

I have a small dataset (90 with 23 events) and have performed an exact logistic regression which leads to significant results.

I wanted to add an analysis of the Model Fit Statistics and the Goodness-of-Fit Statistics like AIC, Hosmer-Lemeshow-Test or Mc Fadden’s R. After reading your book about the logistic regression using SAS (second edition) in my understanding all these calculations only make sense respectively are possible if the conventional logistic regression is used. Is my understanding correct? Are there other opportunities to check the Goodness-of-fit in case of using the exact logistic regression? Thank you.

Standard measures of fit are not available for exact logistic regression. I a not aware of any other opportunities.

Dr.Allison,

The article and comments here have been extremely helpful. I’m working on building a predictive model for bus breakdowns with logistic regression. I have 207960 records total with 1424 events in the data set. Based on your comments above, it seems I should have enough events to continue without oversampling. The only issue is that I’m also working with a large number of potential predictors, around 80, which relate to individual diagnostic codes that occur in the engine. I’m not suggesting that all of these variables will be in final model, but is there a limit to the number of predictors I should be looking to include in the final model? Also, some of predictors/diagnostic codes happen rarely as well. Is there any concern having rare predictors in a model with rare events?

Thanks,

Tony

Well, a common rule of thumb is that you should have at least 10 events for each coefficient being estimated. Even with 80 predictors, you easily meet that criterion. However, the rarity of the predictor events is also relevant here. The Firth method could be helpful in reducing any small-sample bias of the estimators. For the test statistics, consider each 2 x 2 table of predictor vs. response. If the expected frequency (under the null hypothesis of independence) is at least 5 in all cells, you should be in good shape.

Hello Dr. Allison,

The data I use is also characterized by having very rare events (~0.5% positives) There are however enough positives (thousands) so should hopefully be ok to employ logistic regression according to your guidelines.

My question comes from a somewhat different angle (which I hope is ok).

I have ~20 predictors which by themselves represent estimated probabilities. The issue is that the level of confidence in these probabilities/predictors may vary significantly. Given that these confidence levels could be estimated, I’m looking for a way to take these confidence levels into account as well, since the predictor’s true weight may significantly depend on its confidence.

One suggested option was to divide each predictor/feature into confidence based bins, so that for each case (example) only a single bin will get an actual (non zero) value. Similar to using “Dummy Variables” for category based predictors. Zero valued features seem to have no effect in the logistic regression formulas (I assume that features would need to be normalized to a 0 mean value)

Could this be a reasonable approach ?

Any other ideas (or alternative models) for incorporating the varying confidence levels of the given predictor values?

Thanks in advance for your time and attention

One alternative: if you can express your confidence in terms of a standard error or reliability, then you can adjust for the confidence by estimating a structural equation model (SEM). You would have to use a program like Mplus or the gsem command in Stata that allows SEM with logistic regression. BTW, if you do dummy variables, there is no need to normalize them to a zero mean.

Thank you so much for your response and advise.

Regarding the option of using dummy variables, here is what I find confusing:

– On the one hand, whenever a feature assumes a value of 0 its weight learning does not seem to be affected (according to the gradient descent formula), or maybe i’m missing something ..

– On the other hand, the features in my case represent probabilities (which are a sort of prediction of the target value). So if in a given example the feature assumes a 0 value (implying a prediction of 0) but the actual target value is 1 it should cause the feature weight to decrease (since, in this example, it’s as far as possible from the true value)

Another related question that I have:

In logistic regression the linear combination is supposed to represent the odds Logit value ( log (p/1-p) ). In my case the features are them selves probabilities (actually sort of “predictions” of the target value). So their linear combinations seems more appropriate for representing the probability of the target value itself rather than its logit value. Since P is typically very small ~0.5% (implying that log (p/1-p) ~= log(p)) would it be preferable to use the log of the features instead of the original feature values themselves as input for the logistic regression model ?

Again, thanks a lot for your advise.

Because there is an intercept (constant) in the model, a value of 0 on a feature is no different than any other value. You can add any constant to the feature, but that will not change the weight or the model’s predictions. It will change the intercept, however.

It’s possible that a log transformation of your feature may do better. Try it and see.

I have a sample of 11,935 persons of whom 944 persons made one and more visits to emergency department during one year. Can I apply logistic regression safely to this data? (My colleague recommended the count data model like ZINB model because conventional logistic regression generates a problem of underestimated OR due to zero excess. But I think an event itself can be sometimes more important information than number of event per patient.)

Yes, I think it’s quite safe to apply logistic regression to these data. You could try the ZINB model, but see my blog post on this topic. A conventional NB model may do just fine. Personally, I would probably just stick to logistic, unless I was trying to develop a predictive model for the number of visits.

Dr. Allison,

I highly appreciate you for the valuable advice. But I have one more question.

He (my colleague) wrote to me:

“Our data have too many zeros of which some may be ‘good’ zeros but others may be ‘bad’ zeros. Then, we should consider that the results of logistic regression underestimate the probability of event (emergency department visit).”

If he is correct, what should I do to minimize this possibility? (Your words ‘quite safe’ in your reply imply that he is wrong, I guess)

If he is wrong, why is he wrong?

Thank you for sparing your time for me.

I would ask your colleague what he means by “too many zeros”. Both logistic regression and standard negative binomial regression can easily allow for large fractions of zeros. I would also want to know what is the difference between “good zeros” and “bad zeros”. Zero-inflated models are most useful when there is strong reason to believe that some of the individuals could not have experienced the event not matter what the values of their predictor variables. In the case of emergency room visits, however, it seems to me that everyone has some non-zero risk of such an event.

Dr. Allison,

Thank you very much. We bought some books on statistics including your books Your advice stimulated us to study important statistical techniques. Thank you.

Dear Dr. Allison,

I need your expertise on selecting appropriate method. I have 5 rare events(Machine failure) out of 2000 observations.

Now, I need to predict when machine will be down based on the historical data, I have 5 columns

1) Error logs – which were generated by the machine (non-numeric)

2) Time stamp – when error message was generated

3) Severity – Severity of each error log (1-low, 2- Medium, 3- High)

4) Run time – No. of hours the machine ran till failure

5) Failed? – Yes/No

Thanks in advance for your help!

With just five events, you’re going to have a hard time estimating a model with any reliability. Exact logistic regression is essential. Are the error logs produced at various times BEFORE the failure, or only at the time of the failure? If the latter, then they are useless in a predictive model. Since you’ve got run time, I would advise some kind of survival analysis, probably a discrete time method so that you can use exact logistic regression.

Thanks Dr. Allison, The error logs were produced at various times BEFORE the failure. Is there a minimum required number of events (or proportion of events)for estimating a model? However, I would try other methods as you advised (Survival, Poisson model)

Well, if you had enough events, I’d advise doing a survival analysis with time dependent covariates. However, I really don’t think you have enough events to do anything useful. One rule of thumb is that you should have at least 5 events for each coefficient to be estimated.

Thank you so much for the post. I am working on the data with only 0.45 percent “yes”s, and your posts were really helpful. The firth method and the rare event logit produces very same coefficients as you explained in your post. The regular post estimation commands such as mfx, however, do not get me the magnitudes of the effects that I would like to see after either method. I read all the posts in the blog, but could not find a clue.

Thank you for your help, Dr. Allison!

The mfx command in Stata has been superseded by the margins command. The firthlogit command is user written and thus may not support the post estimation use of the margins command. The problem with the exlogistic command is that it doesn’t estimate an intercept and thus cannot generate predicted values, at least not in the usual way.

Dear Dr. Allison,

I have 10 events in a sample with 46 observations (including the 10 events). I have run firthlogit in Stata, but I could not use the command fitstat to estimate r2. I would like to ask how I can estimate r2 with Stata? Is there any command?

Thanks in advance for your time and attention.

I recommend calculating Tjur’s R2 which is described in an earlier post. Here’s how to do it after firthlogit:

firthlogit y x1 x2

predict yhat

gen phat = 1/(1+exp(-yhat))

ttest phat, by(y)

The gen command converts log-odds predictions into probabilities. In the ttest output, what you’re looking for is the difference between the average predicted values. You’ll probably have to change the sign.

Dear Allison,

I have a study about bleeding complication after a procedure recently. A total of 185 patients were enrolled in this study and 500 times of procedure were performed. Only 16 events were finally observed. So what kind of method I can use to analyze the predictive factors of this events? I’ve tried logistic regression on SPSS,however the reviewers said “The number of events is very low, which limits the robustness of the multivariable analysis with such a high number of variables. ”

Thanks in advance for your help!

Do you really have 500 potential predictors? If so, you need to classify the procedures into a much smaller number. Then, here’s what I recommend: (1) Do forward inclusion stepwise logistic regression to reduce the predictors to no more than 3. Use a low p-value as your entry criterion, no more than .01. (2) Re-estimate the final model with Firth logit. (3) Verify the p-values with exact logistic regression.

Hi Paul,

In my case I have 14% (2.9 million) of the data with events. Is it fine if I go with MLE estimation?

Thanks!!!!

Yes

Dear Dr Allison,

I’m running some analysis about firms’ relations. I’ve got info on B-to-B relations (suppliers – customers) for almost all Belgian firms (let’s assume that I have all transactions – around 650,000 transactions after cleaning for missing values in explanatory variables) and I want to run a probit or a logit regression of the probability that two firms are connected (A supplies B) and I need to create the 0’s observations. What would be the optimal strategy, taking into account that I cannot create all potential transactions (19,249,758,792) ?

I’ve considered either selecting a random sample of suppliers (10% of original sample) and a random sample of customers (same size) and consider all potential transactions between those two sub-sample or to consider all actual transactions and randomly selected non transactions.

I’d go with the 2nd method–all transactions and a random sample of non-transactions. But with network data, you also need special methods to get the standard errors right. There’s an R package called netlogit that can do this.

Dear Dr. Allison,

I work in fundraising and have developed a logistic regression model to predict the likelihood of a constituent making a gift above a certain level. The first question my coworkers asked is what the time frame is for the predicted probability. In other words, if the model suggests John Smith has a 65% chance of making a gift, they want to know if that’s within the next 2 years, 5 years, or what. The predictor variables contain very little information about time, so I don’t think I have any basis to make this qualification.

The event we’re modeling is already pretty rare (~200 events at the highest gift level) so I’m concerned about dropping data, but the following approach has been suggested: If we want to say someone has a probability of giving within the next 3 years, we should rerun the model but restrict the data to events that happened within the last 3 years. Likewise, if we use events from only the last 2 years, then we’d be able to say someone has a probability of giving within the next 2 years.

Apart from losing data, I just don’t see the logic in this suggestion. Does this sound like a reasonable approach to you?

Any suggestions on other ways to handle the question of time would be much appreciated. It seems like what my coworkers want is a kind of survival analysis predicting the event of making a big gift, but I’ve never done that type of analysis, so that’s just a guess.

Thanks for your time,

DC

Ideally this would be a survival analysis using something like Cox regression. But the ad hoc suggestion is not unreasonable.