[Day 21] Unsupervised Machine Learning Type 4 - Principal Component Analysis(PCA) (with a Small Python Project)
Too many features? PCA squeezes your data into 2 smart axes—keeping the patterns, ditching the noise. 💡📉
🎯 What is PCA?
PCA (Principal Component Analysis) is a technique used to reduce the number of features in your data without losing the important information.
Think of it like:
“I have 8 columns of data. Can I shrink it to 2 or 3 powerful ones that explain most of what’s happening?”
That’s exactly what PCA does:
- It compresses your data smartly.
- Keeps the patterns.
- Helps in faster processing, better visualization, and reducing noise.
🌍 Why is PCA Useful in Real Life?
✅ When your dataset has too many columns (a.k.a high-dimensional).
✅ When features are correlated or repetitive.
✅ When you want to visualize complex data in 2D or 3D.
See video below for better understanding:
🔍 Real-World Examples
🏦 1. Fraud Detection (Banking)
- 8+ transaction attributes like amount, device score, time, etc.
- PCA compresses them into 2 key components.
- Helps visualize and flag abnormal behavior.
🧬 2. Healthcare - Disease Pattern Recognition
- Thousands of gene expressions for patients.
- PCA helps extract the 2–3 most significant patterns.
- Aids in clustering or diagnosis.
🛍️ 3. Customer Behavior Analysis
- Retailers track browsing, purchases, app activity.
- PCA simplifies the customer profile while retaining behavioral signals.
🖼️ 4. Image Compression / Face Recognition
- An image has thousands of pixels.
- PCA converts them into a few ‘Eigenfaces’ — compressed faces that still carry identity!
🔧 How PCA Works (Step-by-Step)
Step 1: Standardize the Data
👉 Features must be on the same scale.
Step 2: Find Directions of Maximum Variance
👉 PCA finds the axes (principal components) where data varies the most.
Step 3: Project the Data onto New Axes
👉 The data is re-expressed using fewer features.
🎯 Result: Smaller data, same meaning. Ready for clustering, anomaly detection, or modeling.
🖥 Python
Mini Project: SmartSqueeze: PCA-Based Financial Data Compression
Imagine you're a data analyst at a fintech company. You’ve received transaction logs containing 8 features per transaction like:
- Amount
- Hour of day
- Device Score
- Whether the user has previously committed fraud
- Risk score of the merchant
- Customer tenure
- Number of transactions in the last 30 days
- Location variance
That’s a lot of dimensions to look at — especially if you want to detect patterns or visualize user behavior.
➡️ So you use PCA to reduce it to just 2 powerful features (PCA1 & PCA2).
Dataset:
📊 Enhanced Transaction Dataset (35 Rows × 8 Features)
Transaction ID | Amount ($) | Hour | Device Score | Prev Fraud (0/1) | Location Variance | Merchant Risk Score | Customer Tenure (Years) | Num Transactions (30d) |
---|---|---|---|---|---|---|---|---|
1 | 1252 | 0 | 0.80 | 0 | 0.43 | 0.03 | 7 | 3 |
2 | 4866 | 12 | 0.64 | 0 | 1.88 | 0.88 | 4 | 2 |
3 | 4150 | 8 | 0.67 | 1 | 0.50 | 0.88 | 4 | 27 |
4 | 1862 | 5 | 0.72 | 1 | 1.33 | 0.61 | 4 | 14 |
5 | 5291 | 0 | 0.79 | 0 | 1.14 | 0.72 | 2 | 3 |
6 | 3624 | 1 | 0.42 | 0 | 1.68 | 0.42 | 5 | 27 |
7 | 3439 | 14 | 0.69 | 0 | 1.64 | 0.95 | 7 | 45 |
8 | 2388 | 2 | 0.44 | 0 | 1.90 | 0.76 | 1 | 20 |
9 | 1363 | 7 | 0.59 | 0 | 1.58 | 0.69 | 5 | 33 |
10 | 1940 | 2 | 0.57 | 1 | 1.23 | 0.83 | 5 | 12 |
11 | 4075 | 20 | 0.52 | 1 | 0.88 | 0.36 | 6 | 45 |
12 | 5682 | 20 | 0.65 | 1 | 0.95 | 0.73 | 1 | 1 |
13 | 1898 | 21 | 0.90 | 0 | 1.10 | 0.41 | 3 | 18 |
14 | 2251 | 15 | 0.45 | 0 | 1.30 | 0.98 | 4 | 37 |
15 | 3167 | 7 | 0.62 | 0 | 1.36 | 0.51 | 7 | 15 |
16 | 1194 | 0 | 0.42 | 0 | 0.99 | 0.96 | 0 | 20 |
17 | 1053 | 7 | 0.88 | 0 | 0.44 | 0.46 | 6 | 8 |
18 | 2901 | 5 | 0.60 | 1 | 0.92 | 0.34 | 3 | 41 |
19 | 2776 | 9 | 0.74 | 1 | 0.66 | 0.93 | 8 | 26 |
20 | 1115 | 3 | 0.79 | 0 | 0.21 | 0.91 | 6 | 1 |
21 | 2912 | 4 | 0.97 | 0 | 1.09 | 0.21 | 6 | 18 |
22 | 4438 | 9 | 0.60 | 0 | 1.63 | 0.51 | 4 | 17 |
23 | 4333 | 3 | 0.72 | 0 | 1.17 | 0.89 | 4 | 3 |
24 | 2874 | 2 | 0.72 | 0 | 0.51 | 0.20 | 0 | 12 |
25 | 1454 | 5 | 0.80 | 0 | 1.85 | 0.46 | 3 | 47 |
26 | 4688 | 1 | 0.33 | 0 | 0.73 | 0.52 | 7 | 10 |
27 | 4866 | 14 | 0.49 | 0 | 1.66 | 0.52 | 4 | 36 |
28 | 2795 | 7 | 0.79 | 0 | 0.62 | 0.90 | 5 | 7 |
29 | 2063 | 19 | 0.35 | 1 | 1.81 | 0.74 | 4 | 8 |
30 | 3517 | 22 | 0.26 | 0 | 1.88 | 0.34 | 3 | 44 |
31 | 4314 | 2 | 0.95 | 0 | 0.36 | 0.10 | 4 | 15 |
32 | 2545 | 4 | 0.63 | 0 | 0.29 | 0.93 | 6 | 35 |
33 | 1720 | 23 | 0.90 | 0 | 0.17 | 0.29 | 0 | 48 |
34 | 4646 | 6 | 0.38 | 1 | 0.97 | 0.29 | 4 | 16 |
35 | 1684 | 0 | 0.56 | 1 | 0.53 | 0.92 | 6 | 18 |
You can save it as enhanced_transaction_data.csv
📝 Python Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Step 1: Load the data
data = pd.read_csv("enhanced_transaction_data.csv")
# Step 2: Select features for PCA
features = ['Amount ($)', 'Hour', 'Device Score', 'Prev Fraud (0/1)',
'Location Variance', 'Merchant Risk Score',
'Customer Tenure (Years)', 'Num Transactions (30d)']
X = data[features]
# Step 3: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 4: Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
components = pca.fit_transform(X_scaled)
data['PCA1'] = components[:, 0]
data['PCA2'] = components[:, 1]
# Step 5: Visualize the PCA projection
plt.figure(figsize=(10, 6))
sns.scatterplot(
x='PCA1',
y='PCA2',
data=data,
hue='Prev Fraud (0/1)',
palette='coolwarm',
s=100,
edgecolor='black'
)
plt.title('PCA Projection of Transaction Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.legend(title='Previous Fraud')
plt.tight_layout()
plt.show()
Result:

🔬 What Do PCA1 and PCA2 Represent?
- PCA1 (Principal Component 1): This new feature captures the maximum variation in the data. It’s a combination of the original 8 features.
- PCA2 (Principal Component 2): This captures the second most important direction of variation, uncorrelated with PCA1.
Together, PCA1 and PCA2 give you a compressed view of your data — 2 dimensions that explain the most meaningful behavior across all 8 features.
They are not just one feature like "Amount" or "Hour" — they are mathematical combinations of all features weighted by their importance.
🔎 What the PCA Graph Tells You:
- Each point = a transaction
- Position in the PCA1–PCA2 space = summary of all 8 features
- Color = fraud label (0 = No fraud, 1 = Fraud)
✅ Insights You Can Get:
- Fraudulent transactions may appear as separate clusters or outliers.
- Transactions with similar patterns are grouped together.
- You can zoom in on suspicious zones or investigate clusters.
Nutshell
- PCA1 & PCA2 are not original features, but powerful summaries of your data.
- PCA helps you visualize high-dimensional data and find meaningful patterns.
- It’s advantageous before applying clustering (like DBSCAN) or anomaly detection.
💬 Join the DecodeAI WhatsApp Channel for regular AI updates → Click here