Logistic Regression in the Real World
Let’s get real here. We’ve all, at one point or another, come across logistic regression. Maybe we’ve seen it when studying, when working or some passerby mentioned it and caught your attention. In this great world of data science it seems like logistic regression is always present, and everyone uses it, but what exactly is it used for?
Let’s get started by setting the logistic regression stage before moving on to the showcase.
What is Logistic Regression?
You know already that logistic regression classifies the dependent variable in a dichotomous, binary approach. Either the result is or isn’t. Either the email is spam or it isn’t. Either the patient has cancer or doesn’t. There is no half-way. You can’t have an email that’s almost spam, or a patient that has 50% cancer.
This is what we all know as binary or binomial regression. The question you ask yourself can be answered only in a yes or no fashion.
On top of that, there are also multinomial and ordinal logistic regressions.
Simply put, multinomial regression has a dependent variable which has more than two outcomes, unordered and with no quantitative importance. So, for example, classifying meals into “Vegetarian”, “Non-Vegetarian” and “Vegan”. No one choice is quantitatively more important than the other, but it’s not a simple binary output.
Ordinal regression, on the other hand, does take into account ordering and quantitative importance, all the while having more than two possible outputs. A good example of this is a Likert scale, where surveyees can answer whether a particular service has been “Bad”, “Neutral” or “Good”. There is an importance when it comes to ordering, and one option bears more importance than the other.
With that stats refresher done, we now arrive at the crucial question: “Yeah, ok. But how does this apply to the real world? It’s one thing to see logistic regression in a textbook, but how can I solve problems using this?”
Logistic Regression in the Real World
Let’s take a look at how different businesses have used logistic regression in order to classify, identify or solve any one of their problems.
First of all, let’s look at examples from the medical field, specifically biostatistics.
A sleep clinic in Toronto conducted a study based on the following question: “Is there a correlation between sleep apnoea and blood hypertension?”
As you can guess, their idea was to use logistic regression in order to predict the probability of a patient developing hypertension based on whether or not they suffered from sleep apnoea.
They surveyed approximately 2,700 adults, and after running their tests, found the following:
“Blood pressure and number of patients with hypertension increased linearly with severity of sleep apnoea, as shown by the apnoea-hypopnoea index. Multiple regression analysis of blood pressure levels of all patients not taking antihypertensives showed that apnoea was a significant predictor of both systolic and diastolic blood pressure after adjustment for age, body mass index, and sex. Multiple logistic regression showed that each additional apnoeic event per hour of sleep increased the odds of hypertension by about 1%, whereas each 10% decrease in nocturnal oxygen saturation increased the odds by 13%.”
Here’s another case study:
A medical team in the Netherlands wanted to predict the possibility of a person suffering from Crohn’s disease based on whether or not certain bacteria react to sera from patients with certain diseases. To do that, they tested patients who not only suffered from Crohn’s, but had different ailments.
What they found out was the following:
“With the methods and interpretation described, 52% of the patients with Crohn ‘s disease were recognized as ‘definite’ or ‘probable’ Crohn’s disease and 14% as ‘suspected’. Only 1% of the healthy subjects were classified as ‘suspected’ and none as ‘definite’ or ‘probable’ Crohn’s disease.”
In other words, given their study, they were able to correctly predict that over half of the sample they used had an indicative of Crohn’s, without misidentifying a single healthy person. All this based on how probable the reaction would be between certain bacteria and a patient’s serum.
Now, let’s move on to another field.
Say for example you’re an aspiring restaurant owner. You are concerned with two things: staying in business and keeping your place full.
As it turns out, there are some data scientists who devoted their efforts to answering those two questions. Here’s the first one:
A pair of students at the University of Massachusetts decided to tackle the following question: “What is the probability that a restaurant will go bankrupt and what are the key factors that come into play?”
They surveyed a total of 32 hospitality firms, and came to the following conclusion:
“The logit models, resulting from forward stepwise selection procedures, could correctly predict 91% and 84% of bankruptcy cases 1 and 2 years earlier, respectively. The estimated models imply that a hospitality firm is more likely to go bankrupt if it has lower operating cash flows and higher total liabilities. The models suggest that a prudent sales growth strategy accompanied by tighter control of operating expenses and less debt financing can help enhance a firm’s ability to meet its financial obligations and thereby reduce bankruptcy risk.”
So, as a keen business owner, you now know that in order to avoid bankruptcy, you should veer towards maintaining a healthy growth strategy, control better your operating expenses and reduce your debt financing. But, what about the flow of customers?
David Nadler had the idea of analyzing the data offered from the New York City Department of Health and Mental Hygiene in order to predict whether or not the type of restaurant that has opened has any effect on what type of grade the health department would give them. Here’s what he found out:
“Citywide, all of the restaurant types except Italian had significant crude odd ratios for the prediction of the highest grade. All of the restaurant types except American-style restaurants showed significant odds ratios. Logistic regression further showed that Caribbean, Chinese, Italian, Japanese, Latin, Mexican and Pizzerias had lower odds of receiving the highest grade when using American-style restaurants as the reference.”
Now you know that, when starting a restaurant business, keep a healthy growth strategy, control your operating expenses, reduce your debt financing, and (at least in New York), stay away from opening a pizzeria.
Let’s look at another industry: the giant telecommunications industry.
Like the banking industry, the telecom industry is one that is constantly concerned with churn rates. In other words, the key question every manager in the telecom industry asks themselves is: “How can I predict whether or not the customer that’s buying my service will remain with the company?”
That’s exactly what B. E. A. Oghojafor, G. C. Mesike, C. I. Omoera and R. D. Bakare asked themselves, and they came up with a logistic regression model which predicts customer churn rate based on socio-cultural factors in their hometown of Lagos, Nigeria.
As a result of their study, they came across this result:
“Call expenses, providers’ advertisement medium, type of service plan, number of mobile connections and providers’ service facilities developed in the survey scale of this study are reliable indicators of likelihood of customers’ attrition and can be a training guideline for telecom service providers in Nigeria.”
In other words, having identified the previous variables, they were able to correctly predict which telecom companies will have higher attrition and churn rates by using logistic regression. Interestingly enough, their study also concludes that although socio-cultural factors don’t directly affect churn rates, whether or not the surveyee was married did have a lower odds ratio than the rest of the variable.
On another continent, two data scientists conducted a similar analysis. Helen Treasa Sebastian and Rupali Wa wanted to also figure out what are the primary factors involved in the telecom churn rate. In their analysis, they looked into whether or not the amount of time a user has been consuming a telecom service has any effect on whether or not they were likely to churn. They ran the data and they found:
“…it is clearly stated that from a range of 0-30 months are the people who are most likely to churn and 30-60 months most likely not and anything above 60 months are customers who would ideally not churn.”
They were able to predict this with an 80.02% success rate, which for the telecom industry is quite the advantage.
Now, taking into consideration those case studies, it’s more than evident just how powerful logistic regression can be. Whether it’s to simply keep your restaurant full, or to bump the bottom line of a massive telecom industry, there’s seemingly no challenge that logistic regression can’t handle. With clean data, that is.
Do you want to know more about how to use logistic regression in real world scenarios? Why don’t you check out our course where our expert instructors guide you through more real-world use cases of this fantastic tool we call logistic regression.