[Day 17] Supervised Machine Learning Type 8 - Gradient Boosting Machines (GBM) (with a Small Python Project)
Failed your first test? Learn from it! Thatβs exactly how Gradient Boosting works. Discover how machines ace predictions just like you would! π‘π
Imagine you are preparing for a math exam. You take a mock test to see how well you perform.
- Your first score? 50/100.
- Not great, but now you know where you struggle!
Instead of giving up, you analyze your mistakes:
β
You did well in geometry
β You struggled with algebra
What do you do? You donβt study everything from scratch again. Instead:
- You focus more on algebra, since thatβs where you lost the most marks.
- You take another test, and your score improves to 70/100.
- You repeat this process until you consistently score 95+.
This step-by-step learning process is exactly how Gradient Boosting Machines (GBM) work in machine learning.
What is Gradient Boosting?
Gradient Boosting is a machine learning algorithm that improves predictions step by step by focusing on errors. It builds multiple weak models (small decision trees) and learns from mistakes over time.
How GBM Works (Step by Step)
Step 1: Initial Guess (First Tree)
- The model makes a rough prediction (like your first mock test score).
- Letβs say we are predicting house prices.
- The model predicts $300,000 for a house, but the actual price is $350,000.
- Error (Residual) = $350,000 - $300,000 = $50,000.
Step 2: Learn from Mistakes
- The model builds another small tree to predict the error ($50,000).
- Instead of fully correcting the mistake, it applies a small adjustment (controlled by the learning rate).
Step 3: Repeat Until Errors are Small
- Each new tree tries to correct the errors of the previous one.
- After many iterations, the final prediction becomes very accurate.
π― Final Result: A strong model that combines multiple weak models, just like taking multiple tests and improving each time.
Understand it with a simple video explanation:
π Real-World Use Cases of GBM
GBM is used in various industries due to its high accuracy and ability to handle complex data.
1.Banking & Finance π¦ β Credit Risk Scoring
π Problem: Banks need to decide whether to approve or reject a loan based on a customer's credit history.
π― GBM Solution:
- Predicts loan default risk by analyzing financial behavior.
- Uses customer data (credit score, income, debt-to-income ratio) to classify applicants as low-risk or high-risk borrowers.
- Many financial institutions, including JP Morgan, Wells Fargo, and Capital One, use GBM for fraud detection & credit scoring.
πΉ Why GBM? It handles noisy data and missing values well, making it ideal for real-world financial datasets.
2.Healthcare π₯ β Disease Diagnosis & Prediction
π Problem: Doctors need to predict the likelihood of a patient having a disease.
π― GBM Solution:
- GBM models are used for early detection of diseases like cancer, diabetes, and heart disease.
- Trained on patient medical history, test results, and lifestyle factors to predict risk levels.
- Used in predictive healthcare systems at hospitals & medical research labs.
πΉ Example: IBM Watson Health uses GBM for predicting hospital readmission rates.
3.E-commerce & Retail π β Customer Churn Prediction
π Problem: Online businesses need to identify customers likely to stop purchasing.
π― GBM Solution:
- Predicts which customers are about to churn (stop using a service).
- Uses purchase history, website activity, customer service interactions to identify at-risk customers.
- Helps businesses like Amazon, Flipkart, and Shopify to proactively offer discounts or loyalty rewards.
πΉ Why GBM? It accurately captures complex customer behaviors, leading to better retention strategies.
π» Python Mini Project β JPMorgan Fraud Detection
Now, letβs build a real-world JPMorgan fraud detection model using GBM.
πΉ Dataset Overview
We simulate a dataset of 20+ credit card transactions. The dataset contains:
- Transaction Amount
- Merchant Category
- Time of Transaction
- Transaction Location
- Previous Fraud History
- Fraudulent Transaction (Target: 0 = No, 1 = Yes)
πΉ Save this table as fraud_data.csv
Here is your fraud transactions dataset in tabular format:
Transaction ID | Amount ($) | Merchant Category | Time (Hour) | Location | Previous Fraud | Fraud (0/1) |
---|---|---|---|---|---|---|
1 | 20 | Grocery Store | 10 | New York | 0 | 0 |
2 | 5000 | Electronics | 23 | Miami | 1 | 1 |
3 | 75 | Restaurants | 20 | Chicago | 0 | 0 |
4 | 200 | Online Shopping | 2 | LA | 0 | 1 |
5 | 1500 | Jewelry | 22 | Vegas | 1 | 1 |
6 | 10 | Coffee Shop | 9 | Boston | 0 | 0 |
7 | 350 | Travel Booking | 5 | Dallas | 0 | 0 |
8 | 4500 | Electronics | 1 | NYC | 1 | 1 |
9 | 95 | Gas Station | 14 | SF | 0 | 0 |
10 | 2200 | Jewelry | 23 | Miami | 1 | 1 |
11 | 45 | Clothing | 12 | Seattle | 0 | 0 |
12 | 3200 | Electronics | 3 | Austin | 1 | 1 |
13 | 80 | Entertainment | 18 | Denver | 0 | 0 |
14 | 180 | Pharmacy | 11 | Houston | 0 | 0 |
15 | 2500 | Luxury | 22 | NYC | 1 | 1 |
16 | 30 | Grocery Store | 15 | New York | 0 | 0 |
17 | 500 | Online Shopping | 7 | San Diego | 0 | 0 |
18 | 4000 | Electronics | 4 | LA | 1 | 1 |
19 | 60 | Restaurants | 19 | Boston | 0 | 0 |
20 | 2900 | Jewelry | 23 | Vegas | 1 | 1 |
πΉ Python Code for GBM Model with Manual Input
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
# Step 1: Load Dataset
data = pd.read_csv("fraud_data.csv")
# Step 2: Define Features (X) and Target Variable (y)
X = data[['Amount ($)', 'Time (Hour)', 'Previous Fraud']]
y = data['Fraud (0/1)']
# Step 3: Split into Training & Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train GBM Model
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)
# Step 5: Model Accuracy
accuracy = accuracy_score(y_test, gbm.predict(X_test))
print(f"Model Accuracy: {accuracy:.2f}")
# Step 6: User Manual Input for Fraud Detection
print("\nEnter transaction details to check if it's fraudulent:")
amount = float(input("Enter transaction amount ($): "))
time = int(input("Enter transaction time (hour of the day, 0-23): "))
previous_fraud = int(input("Was the user previously involved in fraud? (1=Yes, 0=No): "))
# Create DataFrame for prediction
input_data = pd.DataFrame([[amount, time, previous_fraud]], columns=['Amount ($)', 'Time (Hour)', 'Previous Fraud'])
# Predict Fraud Probability
fraud_prediction = gbm.predict(input_data)[0]
fraud_probability = gbm.predict_proba(input_data)[0][1]
# Display Result
if fraud_prediction == 1:
print(f"\nβ οΈ ALERT: This transaction is likely **FRAUDULENT** with {fraud_probability*100:.2f}% probability!")
else:
print(f"\nβ
This transaction is likely **SAFE**, with a fraud probability of {fraud_probability*100:.2f}%.")
My input & output:

The transaction I checked manually was 'safe'.
Step-by-Step Explanation of the Fraud Detection Model Using GBM:
π Step 1: Load Dataset
data = pd.read_csv("fraud_data.csv")
- Reads the dataset (
fraud_data.csv
) into a Pandas DataFrame.
π Step 2: Define Features (X) and Target (y)
X = data[['Amount ($)', 'Time (Hour)', 'Previous Fraud']]
y = data['Fraud (0/1)']
- X (Features): Transaction Amount, Time of Transaction, and Previous Fraud History.
- y (Target): Whether the transaction is fraudulent (
1
) or safe (0
).
βοΈ Step 3: Split Data into Training & Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- 80% Data β Used for Training
- 20% Data β Used for Testing
random_state=42
ensures consistent results each time.
π€ Step 4: Train the Gradient Boosting Model
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)
- Creates a GBM model with:
100
decision trees (n_estimators=100
).- A learning rate of
0.1
(how much the model adjusts at each step). - Max tree depth of
3
to prevent overfitting.
- The model learns fraud patterns from training data.
π Step 5: Check Model Accuracy
accuracy = accuracy_score(y_test, gbm.predict(X_test))
print(f"Model Accuracy: {accuracy:.2f}")
- Compares predictions on test data with actual fraud labels.
- Prints the accuracy of the model.
π Step 6: User Input for Fraud Detection
amount = float(input("Enter transaction amount ($): "))
time = int(input("Enter transaction time (hour of the day, 0-23): "))
previous_fraud = int(input("Was the user previously involved in fraud? (1=Yes, 0=No): "))
- Takes user input for a new transaction.
- Asks for:
- Transaction Amount ($)
- Time of transaction (0-23 hours)
- Whether the user had previous fraud history (
1=Yes, 0=No
).
π Step 7: Create DataFrame for Prediction
input_data = pd.DataFrame([[amount, time, previous_fraud]], columns=['Amount ($)', 'Time (Hour)', 'Previous Fraud'])
- Converts the user input into a DataFrame, matching the format of the training data.
β οΈ Step 8: Predict Fraud Probability
fraud_prediction = gbm.predict(input_data)[0]
fraud_probability = gbm.predict_proba(input_data)[0][1]
- Predicts whether the transaction is fraud (
1
) or safe (0
). - Calculates the probability of fraud.
π¨ Step 9: Display the Result
if fraud_prediction == 1:
print(f"\nβ οΈ ALERT: This transaction is likely **FRAUDULENT** with {fraud_probability*100:.2f}% probability!")
else:
print(f"\nβ
This transaction is likely **SAFE**, with a fraud probability of {fraud_probability*100:.2f}%.")
- If
1
(fraud detected) β Shows an ALERT with fraud probability. - If
0
(safe transaction) β Shows SAFE with fraud probability.
π‘ Nutshell
Gradient Boosting Machines (GBM) are powerful, accurate, and widely used in real-world applications. Whether predicting fraudulent transactions, diagnosing diseases, or forecasting stock prices, GBM remains one of the top choices for structured data problems.
β
GBM learns step-by-step, improving at each stage.
β
It is widely used in banking, healthcare, e-commerce, and cybersecurity.
β
It is one of the best algorithms for structured data analysis.
β
It is highly customizable and requires hyperparameter tuning.
π Supervised Learning Algorithms Comparison
Algorithm | Type | How It Works | Best Used When | Modern Use Cases |
---|---|---|---|---|
Linear Regression | Regression | Fits a straight line to minimize error between predicted and actual values. | You need a simple, interpretable model for predicting continuous values. | Predicting prices, forecasting sales, risk scoring. |
Logistic Regression | Classification | Estimates probability using the logistic function to classify outcomes. | Youβre classifying binary or multi-class targets with linear boundaries. | Spam detection, churn prediction, disease diagnosis. |
Decision Tree | Both | Splits data into branches based on decision rules (if-else) for prediction. | You want interpretable rules and can handle non-linear data. | Loan approval, fraud detection, rule-based workflows. |
Support Vector Machine (SVM) | Classification | Finds the best boundary (hyperplane) that separates classes with the widest margin. | You need high accuracy and clear margins between classes (even in small datasets). | Image classification, face detection, bioinformatics. |
K-Nearest Neighbors (KNN) | Both | Classifies based on the majority class among the K nearest neighbors. | You want simplicity and donβt need a model explanation (great for cold-start problems). | Recommender systems, customer segmentation, personalization. |
Naive Bayes | Classification | Applies Bayes' Theorem assuming independence between features. | You have categorical data and want fast, baseline classification. | Text classification, email filtering, sentiment analysis. |
Random Forest | Both | Builds multiple decision trees and averages their predictions (ensemble learning). | You need high performance, non-linear classification/regression with feature importance. | Credit scoring, e-commerce ranking, fraud detection. |
Gradient Boosting (GBM) | Both | Builds trees sequentially where each tree corrects the errors of the previous ones. | You want top performance and can handle slower training time and hyperparameter tuning. | Click-through rate prediction, stock trend prediction, customer lifetime value estimation. |
When to Use Which Algorithm?
- Regression Problems:
- Use Linear Regression for interpretable, linear relationships.
- Use Random Forest/GBM for non-linear, high-accuracy needs.
- Classification Problems:
- Use Logistic Regression for binary/multi-class linear problems.
- Use SVM for small-to-medium datasets with clear margins.
- Use Naive Bayes for text/NLP tasks (fast but simplistic).
- Use Random Forest/GBM for tabular data with complex patterns.
- Modern Trends:
- GBM variants (XGBoost, LightGBM, CatBoost) dominate structured data competitions.
- Hybrid models (e.g., RF + SVM) are used in healthcare/biology.
- Deep Learning (CNNs/RNNs) replaces traditional methods for image/text/sequential data.
π¬ Join the DecodeAI WhatsApp Channel for regular AI updates β Click here