πŸ’¬ Join the DecodeAI WhatsApp Channel for more AI updates β†’ Click here

[Day 17] Supervised Machine Learning Type 8 - Gradient Boosting Machines (GBM) (with a Small Python Project)

Failed your first test? Learn from it! That’s exactly how Gradient Boosting works. Discover how machines ace predictions just like you would! πŸ’‘πŸ“Š

[Day 17] Supervised Machine Learning Type 8 - Gradient Boosting Machines (GBM) (with a Small Python Project)

Imagine you are preparing for a math exam. You take a mock test to see how well you perform.

  • Your first score? 50/100.
  • Not great, but now you know where you struggle!

Instead of giving up, you analyze your mistakes:

βœ… You did well in geometry
❌ You struggled with algebra

What do you do? You don’t study everything from scratch again. Instead:

  1. You focus more on algebra, since that’s where you lost the most marks.
  2. You take another test, and your score improves to 70/100.
  3. You repeat this process until you consistently score 95+.

This step-by-step learning process is exactly how Gradient Boosting Machines (GBM) work in machine learning.


What is Gradient Boosting?

Gradient Boosting is a machine learning algorithm that improves predictions step by step by focusing on errors. It builds multiple weak models (small decision trees) and learns from mistakes over time.

How GBM Works (Step by Step)

Step 1: Initial Guess (First Tree)

  • The model makes a rough prediction (like your first mock test score).
  • Let’s say we are predicting house prices.
  • The model predicts $300,000 for a house, but the actual price is $350,000.
  • Error (Residual) = $350,000 - $300,000 = $50,000.

Step 2: Learn from Mistakes

  • The model builds another small tree to predict the error ($50,000).
  • Instead of fully correcting the mistake, it applies a small adjustment (controlled by the learning rate).

Step 3: Repeat Until Errors are Small

  • Each new tree tries to correct the errors of the previous one.
  • After many iterations, the final prediction becomes very accurate.

🎯 Final Result: A strong model that combines multiple weak models, just like taking multiple tests and improving each time.

Understand it with a simple video explanation:

🌍 Real-World Use Cases of GBM

GBM is used in various industries due to its high accuracy and ability to handle complex data.

1.Banking & Finance 🏦 – Credit Risk Scoring

πŸ“Œ Problem: Banks need to decide whether to approve or reject a loan based on a customer's credit history.

🎯 GBM Solution:

  • Predicts loan default risk by analyzing financial behavior.
  • Uses customer data (credit score, income, debt-to-income ratio) to classify applicants as low-risk or high-risk borrowers.
  • Many financial institutions, including JP Morgan, Wells Fargo, and Capital One, use GBM for fraud detection & credit scoring.

πŸ”Ή Why GBM? It handles noisy data and missing values well, making it ideal for real-world financial datasets.


2.Healthcare πŸ₯ – Disease Diagnosis & Prediction

πŸ“Œ Problem: Doctors need to predict the likelihood of a patient having a disease.

🎯 GBM Solution:

  • GBM models are used for early detection of diseases like cancer, diabetes, and heart disease.
  • Trained on patient medical history, test results, and lifestyle factors to predict risk levels.
  • Used in predictive healthcare systems at hospitals & medical research labs.

πŸ”Ή Example: IBM Watson Health uses GBM for predicting hospital readmission rates.


3.E-commerce & Retail πŸ›’ – Customer Churn Prediction

πŸ“Œ Problem: Online businesses need to identify customers likely to stop purchasing.

🎯 GBM Solution:

  • Predicts which customers are about to churn (stop using a service).
  • Uses purchase history, website activity, customer service interactions to identify at-risk customers.
  • Helps businesses like Amazon, Flipkart, and Shopify to proactively offer discounts or loyalty rewards.

πŸ”Ή Why GBM? It accurately captures complex customer behaviors, leading to better retention strategies.


πŸ’» Python Mini Project – JPMorgan Fraud Detection

Now, let’s build a real-world JPMorgan fraud detection model using GBM.

πŸ”Ή Dataset Overview

We simulate a dataset of 20+ credit card transactions. The dataset contains:

  • Transaction Amount
  • Merchant Category
  • Time of Transaction
  • Transaction Location
  • Previous Fraud History
  • Fraudulent Transaction (Target: 0 = No, 1 = Yes)
πŸ”Ή Save this table as fraud_data.csv

Here is your fraud transactions dataset in tabular format:

Transaction ID Amount ($) Merchant Category Time (Hour) Location Previous Fraud Fraud (0/1)
1 20 Grocery Store 10 New York 0 0
2 5000 Electronics 23 Miami 1 1
3 75 Restaurants 20 Chicago 0 0
4 200 Online Shopping 2 LA 0 1
5 1500 Jewelry 22 Vegas 1 1
6 10 Coffee Shop 9 Boston 0 0
7 350 Travel Booking 5 Dallas 0 0
8 4500 Electronics 1 NYC 1 1
9 95 Gas Station 14 SF 0 0
10 2200 Jewelry 23 Miami 1 1
11 45 Clothing 12 Seattle 0 0
12 3200 Electronics 3 Austin 1 1
13 80 Entertainment 18 Denver 0 0
14 180 Pharmacy 11 Houston 0 0
15 2500 Luxury 22 NYC 1 1
16 30 Grocery Store 15 New York 0 0
17 500 Online Shopping 7 San Diego 0 0
18 4000 Electronics 4 LA 1 1
19 60 Restaurants 19 Boston 0 0
20 2900 Jewelry 23 Vegas 1 1

πŸ”Ή Python Code for GBM Model with Manual Input

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load Dataset
data = pd.read_csv("fraud_data.csv")

# Step 2: Define Features (X) and Target Variable (y)
X = data[['Amount ($)', 'Time (Hour)', 'Previous Fraud']]
y = data['Fraud (0/1)']

# Step 3: Split into Training & Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train GBM Model
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)

# Step 5: Model Accuracy
accuracy = accuracy_score(y_test, gbm.predict(X_test))
print(f"Model Accuracy: {accuracy:.2f}")

# Step 6: User Manual Input for Fraud Detection
print("\nEnter transaction details to check if it's fraudulent:")
amount = float(input("Enter transaction amount ($): "))
time = int(input("Enter transaction time (hour of the day, 0-23): "))
previous_fraud = int(input("Was the user previously involved in fraud? (1=Yes, 0=No): "))

# Create DataFrame for prediction
input_data = pd.DataFrame([[amount, time, previous_fraud]], columns=['Amount ($)', 'Time (Hour)', 'Previous Fraud'])

# Predict Fraud Probability
fraud_prediction = gbm.predict(input_data)[0]
fraud_probability = gbm.predict_proba(input_data)[0][1]

# Display Result
if fraud_prediction == 1:
    print(f"\n⚠️ ALERT: This transaction is likely **FRAUDULENT** with {fraud_probability*100:.2f}% probability!")
else:
    print(f"\nβœ… This transaction is likely **SAFE**, with a fraud probability of {fraud_probability*100:.2f}%.")
My input & output:

The transaction I checked manually was 'safe'.

Step-by-Step Explanation of the Fraud Detection Model Using GBM:

πŸ›  Step 1: Load Dataset

data = pd.read_csv("fraud_data.csv")
  • Reads the dataset (fraud_data.csv) into a Pandas DataFrame.

πŸ“Š Step 2: Define Features (X) and Target (y)

X = data[['Amount ($)', 'Time (Hour)', 'Previous Fraud']]
y = data['Fraud (0/1)']
  • X (Features): Transaction Amount, Time of Transaction, and Previous Fraud History.
  • y (Target): Whether the transaction is fraudulent (1) or safe (0).

βœ‚οΈ Step 3: Split Data into Training & Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • 80% Data β†’ Used for Training
  • 20% Data β†’ Used for Testing
  • random_state=42 ensures consistent results each time.

πŸ€– Step 4: Train the Gradient Boosting Model

gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)
  • Creates a GBM model with:
    • 100 decision trees (n_estimators=100).
    • A learning rate of 0.1 (how much the model adjusts at each step).
    • Max tree depth of 3 to prevent overfitting.
  • The model learns fraud patterns from training data.

πŸ“ˆ Step 5: Check Model Accuracy

accuracy = accuracy_score(y_test, gbm.predict(X_test))
print(f"Model Accuracy: {accuracy:.2f}")
  • Compares predictions on test data with actual fraud labels.
  • Prints the accuracy of the model.

πŸ“ Step 6: User Input for Fraud Detection

amount = float(input("Enter transaction amount ($): "))
time = int(input("Enter transaction time (hour of the day, 0-23): "))
previous_fraud = int(input("Was the user previously involved in fraud? (1=Yes, 0=No): "))
  • Takes user input for a new transaction.
  • Asks for:
    • Transaction Amount ($)
    • Time of transaction (0-23 hours)
    • Whether the user had previous fraud history (1=Yes, 0=No).

πŸ” Step 7: Create DataFrame for Prediction

input_data = pd.DataFrame([[amount, time, previous_fraud]], columns=['Amount ($)', 'Time (Hour)', 'Previous Fraud'])
  • Converts the user input into a DataFrame, matching the format of the training data.

⚠️ Step 8: Predict Fraud Probability

fraud_prediction = gbm.predict(input_data)[0]
fraud_probability = gbm.predict_proba(input_data)[0][1]
  • Predicts whether the transaction is fraud (1) or safe (0).
  • Calculates the probability of fraud.

🚨 Step 9: Display the Result

if fraud_prediction == 1:
    print(f"\n⚠️ ALERT: This transaction is likely **FRAUDULENT** with {fraud_probability*100:.2f}% probability!")
else:
    print(f"\nβœ… This transaction is likely **SAFE**, with a fraud probability of {fraud_probability*100:.2f}%.")
  • If 1 (fraud detected) β†’ Shows an ALERT with fraud probability.
  • If 0 (safe transaction) β†’ Shows SAFE with fraud probability.

πŸ’‘ Nutshell

Gradient Boosting Machines (GBM) are powerful, accurate, and widely used in real-world applications. Whether predicting fraudulent transactions, diagnosing diseases, or forecasting stock prices, GBM remains one of the top choices for structured data problems.

βœ… GBM learns step-by-step, improving at each stage.
βœ… It is widely used in banking, healthcare, e-commerce, and cybersecurity.
βœ… It is one of the best algorithms for structured data analysis.
βœ… It is highly customizable and requires hyperparameter tuning.

πŸ“Š Supervised Learning Algorithms Comparison

Algorithm Type How It Works Best Used When Modern Use Cases
Linear Regression Regression Fits a straight line to minimize error between predicted and actual values. You need a simple, interpretable model for predicting continuous values. Predicting prices, forecasting sales, risk scoring.
Logistic Regression Classification Estimates probability using the logistic function to classify outcomes. You’re classifying binary or multi-class targets with linear boundaries. Spam detection, churn prediction, disease diagnosis.
Decision Tree Both Splits data into branches based on decision rules (if-else) for prediction. You want interpretable rules and can handle non-linear data. Loan approval, fraud detection, rule-based workflows.
Support Vector Machine (SVM) Classification Finds the best boundary (hyperplane) that separates classes with the widest margin. You need high accuracy and clear margins between classes (even in small datasets). Image classification, face detection, bioinformatics.
K-Nearest Neighbors (KNN) Both Classifies based on the majority class among the K nearest neighbors. You want simplicity and don’t need a model explanation (great for cold-start problems). Recommender systems, customer segmentation, personalization.
Naive Bayes Classification Applies Bayes' Theorem assuming independence between features. You have categorical data and want fast, baseline classification. Text classification, email filtering, sentiment analysis.
Random Forest Both Builds multiple decision trees and averages their predictions (ensemble learning). You need high performance, non-linear classification/regression with feature importance. Credit scoring, e-commerce ranking, fraud detection.
Gradient Boosting (GBM) Both Builds trees sequentially where each tree corrects the errors of the previous ones. You want top performance and can handle slower training time and hyperparameter tuning. Click-through rate prediction, stock trend prediction, customer lifetime value estimation.

When to Use Which Algorithm?

  1. Regression Problems:
    • Use Linear Regression for interpretable, linear relationships.
    • Use Random Forest/GBM for non-linear, high-accuracy needs.
  2. Classification Problems:
    • Use Logistic Regression for binary/multi-class linear problems.
    • Use SVM for small-to-medium datasets with clear margins.
    • Use Naive Bayes for text/NLP tasks (fast but simplistic).
    • Use Random Forest/GBM for tabular data with complex patterns.
  3. Modern Trends:
    • GBM variants (XGBoost, LightGBM, CatBoost) dominate structured data competitions.
    • Hybrid models (e.g., RF + SVM) are used in healthcare/biology.
    • Deep Learning (CNNs/RNNs) replaces traditional methods for image/text/sequential data.

πŸ’¬ Join the DecodeAI WhatsApp Channel for regular AI updates β†’ Click here

πŸ’¬ Join the DecodeAI WhatsApp Channel
Get AI guides, bite-sized tips & weekly updates delivered where it’s easiest – WhatsApp.
πŸ‘‰ Join Now