Artificial Intelligence

[Day 11] Supervised Machine Learning Type 2 - Logistic Regression (with a Small Python Project)

Can AI predict if a patient has diabetes? With Logistic Regression, it can. Dive into one of ML’s simplest and smartest tools for classification.

Akshay Seth

21 Dec 2024 • 7 min read

What is Logistic Regression?

Logistic Regression is a widely used supervised learning algorithm for predicting categorical outcomes (often binary). It is used to model the probability of a particular class or event occurring. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts the probability that a given input belongs to a particular class.

The output of Logistic Regression is a probability that lies between 0 and 1, which is transformed into binary class predictions based on a threshold (usually 0.5).

Use Cases:

Predicting whether an email is spam or not.
Classifying whether a customer will buy a product or not (1 for yes, 0 for no).
Diagnosing diseases based on patient features.

The answer would be either ‘Yes’ or ‘No’ See the graph below

Videos you must see to understand it more:

How Does Logistic Regression Work?

Logistic Function (Sigmoid Curve):

Sigmoid Function:

p = 1 / (1 + e^(-z))

Where:

p: Probability of the target class being 1.
z: Linear combination of input features.

Logistic Regression Equation (Single Feature):

p = 1 / (1 + e^(-(w0 + w1 * x)))

Where:

p: Probability of the target class being 1.
x: Input feature.
w0: Intercept (bias term).
w1: Weight (coefficient) for the feature.

Advantages of Logistic Regression:

Simple and interpretable: It's easy to understand how the model works.
Computationally efficient: It can be applied to large datasets.
Works well with linearly separable data.

Limitations of Logistic Regression:

Non-linear relationships: Logistic Regression struggles when the relationship between features and the target variable is non-linear.
Sensitive to outliers: It is affected by outliers in the dataset, which can distort predictions.
Assumes a linear decision boundary: It may not perform well if the data cannot be separated linearly.

Real-world use cases:

1. Medical Diagnosis (e.g., Predicting Disease Outcome)

Problem: Logistic Regression can be used to predict whether a patient is likely to have a particular disease (e.g., diabetes, cancer) based on their medical history and health metrics (e.g., blood pressure, age, BMI).
Example: Predicting whether a patient has diabetes or not based on features like glucose level, BMI, age, and insulin resistance.
Why Logistic Regression: The outcome is binary (disease vs. no disease), and the relationship between the features and the outcome is often linear or approximately linear, making Logistic Regression a good choice.

Practical Example:

A healthcare provider can use Logistic Regression to predict whether a patient will develop heart disease based on factors like cholesterol levels, smoking habits, age, and exercise routine.

2. Email Spam Detection

Problem: Logistic Regression can classify emails as either spam or non-spam (ham) based on certain features (e.g., words, frequency of specific keywords).
Example: An email service like Gmail or Outlook uses Logistic Regression to identify spam emails and filter them out.
Why Logistic Regression: The model works well for binary classification problems (spam or not spam) and can efficiently handle textual data when combined with techniques like TF-IDF or bag of words.

Practical Example:

Email providers analyze text features such as frequency of certain words (like "win", "free", etc.) to classify emails. Logistic Regression helps to classify new incoming emails as spam based on this training data.

3. Customer Churn Prediction

Problem: Companies, particularly in telecommunications and subscription services, use Logistic Regression to predict whether a customer is likely to churn (i.e., cancel a subscription or stop using the service).
Example: A telecom company can predict whether a customer will leave based on features like usage patterns, payment history, and customer service interactions.
Why Logistic Regression: The outcome is binary (churn or not churn), and it allows companies to proactively offer retention strategies for customers who are likely to leave.

Practical Example:

A subscription service like Netflix or a telecom provider uses Logistic Regression to predict customer churn based on features like average viewing time, payment failures, and customer service issues.

4. Credit Scoring (Loan Approval)

Problem: Financial institutions use Logistic Regression to predict the likelihood of a loan applicant defaulting on their loan based on features like credit score, income, employment status, and loan amount.
Example: Banks can use Logistic Regression to decide whether to approve a loan application or not.
Why Logistic Regression: It is a binary classification problem (approve or deny), and the relationships between financial indicators and the likelihood of default are often linear.

Practical Example:

A bank can predict whether a person is likely to default on a loan by analyzing their credit score, employment history, and past financial behavior.

5. Marketing Campaign Effectiveness

Problem: Logistic Regression can help companies determine the likelihood that a customer will respond positively to a marketing campaign, such as purchasing a product or subscribing to a service, based on features like previous purchase behavior, demographics, and response to past campaigns.
Example: An e-commerce company can use Logistic Regression to predict whether a customer will click on a promotional email or whether they will make a purchase after viewing an ad.
Why Logistic Regression: The outcome is binary (response/no response), and the model works well with marketing features such as past purchases, email click-through rates, and demographic data.

Practical Example:

An online retailer might use Logistic Regression to predict whether a customer is likely to click on a promotional email based on past behavior such as purchase history, site visits, and email engagement.

Quick Python Project

A Python project using Logistic Regression to predict whether a person has diabetes based on certain health features.

Data Set:

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1
5	116	74	0	0	25.6	0.201	30	0
3	78	50	32	88	28.3	0.168	26	0
10	115	75	25	0	31.7	0.472	34	1
7	109	64	23	132	34.9	0.261	38	1
4	140	70	34	0	35.6	0.451	45	1

Save this data as diabetes.csv

Solution:

import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load the dataset (Ensure 'diabetes.csv' is in the current directory)
df = pd.read_csv('diabetes.csv')

# Separate the features and the target variable
X = df.drop('Outcome', axis=1)  # Features (input variables)
y = df['Outcome']  # Target variable (1 for diabetes, 0 for no diabetes)

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model using the dataset
model.fit(X, y)

# Example of predicting the outcome for a new input (for a new patient)
new_data = pd.DataFrame({
    'Pregnancies': [1],
    'Glucose': [90],
    'BloodPressure': [70],
    'SkinThickness': [25],
    'Insulin': [0],
    'BMI': [30.5],
    'DiabetesPedigreeFunction': [0.201],
    'Age': [29]
})

# Make a prediction for the new data
prediction = model.predict(new_data)

# Output the predicted class (0: No Diabetes, 1: Diabetes)
print("Prediction (0: No Diabetes, 1: Diabetes):", prediction[0])

Let’s make predictions:

1- Sample Input:

If the new data is:

Pregnancies: 1
Glucose: 90
BloodPressure: 70
SkinThickness: 25
Insulin: 0
BMI: 30.5
DiabetesPedigreeFunction: 0.201
Age: 29

Sample Output:

Prediction (0: No Diabetes, 1: Diabetes): 0

This means the model predicts the patient does not have diabetes based on the input features.

2- Sample Input:

If the new data is:

{
    'Pregnancies': [1],
    'Glucose': [90],
    'BloodPressure': [70],
    'SkinThickness': [25],
    'Insulin': [0],
    'BMI': [30.5],
    'DiabetesPedigreeFunction': [0.201],
    'Age': [29]
}

The prediction is:

Prediction (0: No Diabetes, 1: Diabetes): 0

Note: This example demonstrates how you can simulate a larger dataset, train a logistic regression model, and use it for predictions. You can continue expanding this dataset with more rows to improve the model’s generalizability. Let me know if you'd like further modifications or another example!

Here’s a step-by-step explanation of the code:

1. Import Required Libraries

import pandas as pd
from sklearn.linear_model import LogisticRegression

pandas: Used for data manipulation and creating DataFrames (structured data tables).
LogisticRegression: A machine learning algorithm from scikit-learn for classification tasks.

2. Load the Dataset

df = pd.read_csv('diabetes.csv')

This assumes the dataset (diabetes.csv) is in the current directory.
pd.read_csv reads the file and loads it as a DataFrame into the variable df.

Dataset Example: The dataset contains information about patients, with features like Glucose, BMI, etc., and a target variable Outcome indicating diabetes (1) or no diabetes (0).

3. Separate Features and Target Variable

X = df.drop('Outcome', axis=1)  # Features (input variables)
y = df['Outcome']  # Target variable (1 for diabetes, 0 for no diabetes)

X contains the independent variables (features) like Pregnancies, Glucose, etc., used to predict the outcome.
y is the dependent variable (Outcome) we are trying to predict:
- 1 for diabetes.
- 0 for no diabetes.

4. Initialize the Logistic Regression Model

model = LogisticRegression(max_iter=200)

LogisticRegression: Used for binary classification (two categories: diabetes/no diabetes).
max_iter=200: Sets the maximum number of iterations for the optimization algorithm to converge.

5. Train the Model

model.fit(X, y)

The .fit() method trains the model on the dataset.
The logistic regression model learns the relationship between features (X) and the target variable (y) during this step.

6. Prepare New Data for Prediction

new_data = pd.DataFrame({
    'Pregnancies': [1],
    'Glucose': [90],
    'BloodPressure': [70],
    'SkinThickness': [25],
    'Insulin': [0],
    'BMI': [30.5],
    'DiabetesPedigreeFunction': [0.201],
    'Age': [29]
})

This is a new patient's data we want to predict for:
- The values correspond to the features in the dataset (e.g., Pregnancies, Glucose, etc.).
The data is passed as a dictionary and converted to a DataFrame using pd.DataFrame.

7. Make a Prediction

prediction = model.predict(new_data)

model.predict() uses the trained model to predict the outcome for the new_data.
Returns 1 (diabetes) or 0 (no diabetes).

8. Output the Prediction

print("Prediction (0: No Diabetes, 1: Diabetes):", prediction[0])

The prediction is printed:
- 0: Indicates no diabetes.
- 1: Indicates diabetes.

Key Points

Logistic Regression is a supervised learning algorithm used for binary classification.
The dataset must have the same structure during training and prediction.
Input values (features) for prediction should match the trained model's features.

💬 Join the DecodeAI WhatsApp Channel for regular AI updates → Click here

What is Logistic Regression?

Sigmoid Function:

Logistic Regression Equation (Single Feature):

Advantages of Logistic Regression:

Limitations of Logistic Regression:

Real-world use cases:

1. Medical Diagnosis (e.g., Predicting Disease Outcome)

2. Email Spam Detection

3. Customer Churn Prediction

4. Credit Scoring (Loan Approval)

5. Marketing Campaign Effectiveness

Quick Python Project

A Python project using Logistic Regression to predict whether a person has diabetes based on certain health features.

1- Sample Input:

Sample Output:

2- Sample Input:

1. Import Required Libraries

2. Load the Dataset

3. Separate Features and Target Variable

4. Initialize the Logistic Regression Model

5. Train the Model

6. Prepare New Data for Prediction

7. Make a Prediction

8. Output the Prediction

Key Points

Sign up for more like this.