Machine Learning Concepts

Core Machine Learning Concepts Part 5 - Word Embeddings - How AI Understands the Meaning of Words

AI's secret sauce for understanding language—turning words into powerful numerical vectors that capture meaning, context, and even creativity

Akshay Seth

22 May 2025 • 2 min read

Introduction: The Language Processing Pipeline

For AI to understand human language, it must first break down text into manageable pieces and then represent those pieces numerically. This process involves two crucial steps: tokenization (splitting text into units) and word embeddings (representing those units mathematically). Together, they form the foundation of modern NLP.

Human language is complex. Words can have multiple meanings (e.g., "bank" = financial institution or riverbank), and relationships between words (e.g., "king" → "queen") are not obvious to machines. Traditional AI systems treated words as isolated symbols, leading to poor understanding. Word embeddings changed everything by representing words as numerical vectors that capture meaning, context, and relationships.

1. Tokenization: The First Step in NLP

What is Tokenization?

Tokenization is the process of splitting text into smaller units called tokens, which can be:

Words ("cat", "running")
Subwords ("un", "happy")
Characters ("c", "a", "t")

Why Tokenization Matters

Converts raw text into processable units
Handles punctuation and special characters
Prepares text for embedding models

The Embedding Process

Tokenize the input text
Convert tokens to numerical IDs
Map IDs to embedding vectors
Process through neural networks

Try it by yourself with: https://platform.openai.com/tokenizer

Example: ‘My Name is Akshay Seth’ has 6 tokens.

**A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

Understanding Word Embeddings

What Are Word Embeddings?

Numerical representations of words in high-dimensional space (usually 50-1000 dimensions) where:

Similar words are close together
Opposite words are far apart
Relationships are preserved (e.g., king - man + woman ≈ queen)

Example: Word2Vec Embeddings

Word	Vector (Simplified)
King	[0.5, -0.2, 0.7]
Queen	[0.48, -0.19, 0.69]
Apple	[-0.3, 0.8, 0.1]

In the same example of tokens, OpenAI has assigned below word embeddings for ‘My name is Akshay Seth’

"Ever wondered how AI sees words? Let's explore the hidden geometry of language in multidimensional space!"

The TensorFlow Embedding Projector reveals how words transform into mathematical vectors stored in vector databases. Here's your quick guide:

Open the portal: Go to projector.tensorflow.org
Load embeddings: Try "Word2Vec All" (10K English words)
- Rotate: Click+drag
- Zoom: Mouse wheel
- Find words: Search box

Navigate space:

See relationships:

"king" → "queen" (similar)
"hot" → "cold" (opposites)

Change views: Switch between PCA/t-SNE projections
Database insight: Each point is a word vector stored in vector DBs like Pinecone

Pro tip: Upload your own embeddings (save as TSV) to visualize custom datasets!

"Words become coordinates in AI's conceptual universe!" 🌌

3. How Word Embeddings Are Created

Method 1: Word2Vec (2013)

Skip-gram: Predicts surrounding words from target
CBOW: Predicts target word from context

Method 2: GloVe (2014)

Uses global word co-occurrence statistics

Method 3: Contextual Embeddings (BERT, GPT)

Generates different embeddings based on context

Useful videos to understand them all in detail: