💬 Join the DecodeAI WhatsApp Channel for more AI updates → Click here

Core Machine Learning Concepts Part 5 - Word Embeddings - How AI Understands the Meaning of Words

AI's secret sauce for understanding language—turning words into powerful numerical vectors that capture meaning, context, and even creativity

Core Machine Learning Concepts Part 5 - Word Embeddings - How AI Understands the Meaning of Words

Introduction: The Language Processing Pipeline

For AI to understand human language, it must first break down text into manageable pieces and then represent those pieces numerically. This process involves two crucial steps: tokenization (splitting text into units) and word embeddings (representing those units mathematically). Together, they form the foundation of modern NLP.

Human language is complex. Words can have multiple meanings (e.g., "bank" = financial institution or riverbank), and relationships between words (e.g., "king" → "queen") are not obvious to machines. Traditional AI systems treated words as isolated symbols, leading to poor understanding. Word embeddings changed everything by representing words as numerical vectors that capture meaning, context, and relationships.


1. Tokenization: The First Step in NLP

What is Tokenization?

Tokenization is the process of splitting text into smaller units called tokens, which can be:

  • Words ("cat", "running")
  • Subwords ("un", "happy")
  • Characters ("c", "a", "t")

Why Tokenization Matters

  • Converts raw text into processable units
  • Handles punctuation and special characters
  • Prepares text for embedding models

The Embedding Process

  1. Tokenize the input text
  2. Convert tokens to numerical IDs
  3. Map IDs to embedding vectors
  4. Process through neural networks

Try it by yourself with: https://platform.openai.com/tokenizer

Example: ‘My Name is Akshay Seth’ has 6 tokens.

**A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

Understanding Word Embeddings

What Are Word Embeddings?

Numerical representations of words in high-dimensional space (usually 50-1000 dimensions) where:

  • Similar words are close together
  • Opposite words are far apart
  • Relationships are preserved (e.g., king - man + woman ≈ queen)

Example: Word2Vec Embeddings

Word Vector (Simplified)
King [0.5, -0.2, 0.7]
Queen [0.48, -0.19, 0.69]
Apple [-0.3, 0.8, 0.1]

In the same example of tokens, OpenAI has assigned below word embeddings for ‘My name is Akshay Seth’

"Ever wondered how AI sees words? Let's explore the hidden geometry of language in multidimensional space!"

The TensorFlow Embedding Projector reveals how words transform into mathematical vectors stored in vector databases. Here's your quick guide:

  1. Open the portal: Go to projector.tensorflow.org
  2. Load embeddings: Try "Word2Vec All" (10K English words)
    • Rotate: Click+drag
    • Zoom: Mouse wheel
    • Find words: Search box

Navigate space:

See relationships:

  • "king" → "queen" (similar)
  • "hot" → "cold" (opposites)
  1. Change views: Switch between PCA/t-SNE projections
  2. Database insight: Each point is a word vector stored in vector DBs like Pinecone

Pro tip: Upload your own embeddings (save as TSV) to visualize custom datasets!

"Words become coordinates in AI's conceptual universe!" 🌌

3. How Word Embeddings Are Created

Method 1: Word2Vec (2013)

  • Skip-gram: Predicts surrounding words from target
  • CBOW: Predicts target word from context

Method 2: GloVe (2014)

Uses global word co-occurrence statistics

Method 3: Contextual Embeddings (BERT, GPT)

Generates different embeddings based on context

Useful videos to understand them all in detail:

💬 Join the DecodeAI WhatsApp Channel
Get AI guides, bite-sized tips & weekly updates delivered where it’s easiest – WhatsApp.
👉 Join Now