The regression methods are an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables.
In this post we will talk about binomial logistic regression and when it's used.
The logistic regression analysis allows to predict if a particular event may happen or not and helps to identify the what variables may be useful in this task
It is very common in many areas and situations. Is used when the outcome variable is binary or dichotomous. For instance, in Healthcare where the goal can be predict if a patient will have or not a certain disease, in Marketing Research predicting whether a person will buy or not a product, or in HR if the employee will leave the company, or in Education predicting the success or unsuccess of a student.
Assumptions of Binomial Logistic Regression
For logistic regression analysis it is necessary to take into account:
- The dependent variable has only two possible outcomes.
- The results of the dependent variable must be independent.
- Include only relevant independent variables.
- More than 30 cases per independent variable.
- As logistic regression assumes that the independent variables are continuous, can be used independent dummy variables to replace categorical variables.
The Binomial Logistic Regression Model
In logistic regression since you want to predict a variable that varies between 0 and 1, this creates a S-shaped (sigmoid) curve trying to fit the data.
The general equation for the logistic regression is:
Where the terms on the right are the standard terms for the independent variables and the intercept in a regression equation. However, on the left-hand side is the natural log of the odds and the quantity “ln(Odds)” is called a logit. It can vary from minus infinity to plus infinity, thus removing the problem of predicting outside the bounds of the dependent variable. The odds are related to the probability (p) by:
Note that there is a linear relationship with the independent variables in logistic regression, but it is linear in the natural log of the odds and not in the original probabilities. Since we are interested in the probability of an event, i.e., the higher code in a dichotomous variable, we can combine the two equations into an equation for the probability.
Predicting low birth: A Binomial Logistic Regression Example
Using a classical dataset, we will attempt to classify if the baby will born with low or higher weight due to some of the the mother's characteristics. Saying so, let’s consider that we want to study the effect of smoking, race, hypertension and premature history, uterine irritability, among others, on the baby weight classification according to the scale “Greater than 2500g = 0; Less than or equal 2500g = 1”.
The table about the coefficient information with each independent variable plus the constant is listed below.
The logistic model is in terms of the log of the odds ratio, thus the B is the effect of a unit change in an independent variable on the odds. Let's see for example the hypertension history of the mother (Wald statistic = 6.431), the estimated effect is to increase the odds by 1.746, this is a dummy variable, so we can simply state that hypertension in the mother increases the odds by 1.746. But how this impact on the probabilities? Is better to use to interpret the column Exp(B) this value represents odds ratio, so if the mother has hypertension, we estimate that the odds of her having a low birth weight baby increase by a factor of 5.731.
Looking at the significance values we can also see that weight of the mother on the last menstrual cycle, and smoking, history of drug abuse and alcohol are significant predictors of a low birth weight baby (p-value < 0.05).
The variable importance charts show us what are the independent variables that have a significant effect on the weight of a newborn baby. Aligned with the previous results we can see that the hypertension history, alcohol abuse, smoke and drug addiction are one of the most important variables.
Logistic regression also provides two measures that are analogs to R2 in OLS regression. Because of the relationship between the mean and standard deviation for a dichotomous variable, the amount of variance explained by the model must be defined differently. The Cox and Snell (r2ML) pseudo R2 is 0.154 and the Nagelkerke (r2CU) pseudo R2 is 0.217. Usually the Nagelkerke pseudo R2 is to be preferred because it can, unlike the Cox and Snell R2, achieve a maximum value of one. By either measure, the independent variables can only explain a modest amount of the variance.
This table shows the goodness of fit of our model. We have a Chi-square value of 8.303 (p-value= 0.404 > 0.05), tha indicates that the model have a good fit and the values estimated by the model are close to the observed values.
Other information we can obtain from the binomial logistic regression analysis is how well our model classifies our dependent variable.
The overall accuracy is 74%. However from this table, we can also see that we are doing much better for babies of higher birth weight, as the model correctly predicted 119/130, or 91.5% of these cases. It does a relatively poor job for predicting low birth weight babies, only getting 21/59, or 35.6% correct. In a clinical context the interest would be in the low weight babies, so the current model would certainly not be acceptable.
The sensitivity (also called the true positive rate) is defined as the proportion of low weight babies. A highly sensitive test is one that correctly identifies the low weight babies. The specificity of a test (also called the True Negative Rate) refers to how well a test identifies the babies that have higher weight birth. In this model we are less sensitive (65.6%) and high specificity (75.8%).
This illustrates the lack of correspondence between statistical fit of the model from likelihood statistics, or the significance of individual variables, and the predictive ability of the model. Finding a significant model does not mean having high predictability.
After run the model, it's wise to check how well this fits the data, this diagnostic procedure can be also access by looking at the residuals. As in any regression model is important to look for unusual observations or odd patterns in the data. Understand the deviance of the data points will helps us to understand the model results.
Note: All the analysis was performed using Quark Analytics Portal (check more @ www.quarkanalytics.com).