πŸ’¬ Join the DecodeAI WhatsApp Channel for more AI updates β†’ Click here

'AI Engineering' Summary

Building Applications with Foundation Models.

'AI Engineering' Summary

In 2025:

> Enterprises spent $37 billion on generative AI aloneβ€”a staggering 3.2x increase from 2024.

> 88% of organizations now use AI in at least one business function (McKinsey Global Survey 2025).

> Globally, private AI investment hit record highs, with generative AI attracting billions and adoption surging to where roughly one in six people worldwide use gen AI tools. The global AI software market is projected to reach hundreds of billions by the end of the decade, driven by foundation models whose training costs now exceed hundreds of millions for frontier systems.

This explosive growth marks the era of AI Engineeringβ€”a paradigm shift from traditional machine learning's model-centric focus to building robust, production-grade systems around powerful off-the-shelf foundation models. Chip Huyen's AI Engineering captures this transformation perfectly. Here's a comprehensive, standalone deep-dive that distills every major concept, pattern, and strategy from the bookβ€”no need to read the original to master modern AI building.


Chapter 1: The Shift to AI Engineering and the Power of Scale

Traditional ML engineering revolved around training custom models from scratch on labeled dataβ€”think collecting thousands of resumes to train a named entity recognizer with spaCy.

Today, AI engineering leverages foundation models (like GPT-4, Llama, or Claude) via APIs or open weights. The focus shifts to data-and-context-centric design: prompting, retrieval, orchestration, and reliability engineering for non-deterministic outputs.

State of play of AI progress (and related brakes on an intelligence  explosion)
Source: https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion

Scale defines everything. Larger models (now in the trillions of parameters for frontier ones) unlock emergent capabilitiesβ€”abilities like code generation that arise unexpectedly from next-token prediction.

The modern AI stack includes:

  • Infrastructure (GPUs, serving)
  • Model development (finetuning, quantization): Mix proprietary (Claude 4.6 Opus, Gemini 3.1 Pro) + open (Llama 4 Maverick, DeepSeek V3.2, Mistral 3).
  • Application layer : Prompting β†’ Advanced RAG β†’ Multi-agents β†’ Feedback flywheel + UI.

Real Production Example: DoorDash's dasher support chatbot (deployed 2025–2026) β€” in-house RAG + LLM guardrails + LLM-as-judge β†’ automates responses, slashes resolution time. They route simple queries to cheaper models, complex to reasoning-heavy ones like Claude.

Chapter 2: How Foundation Models Really Work

LLMs process tokensβ€”subword units from text turned into integers via tokenization.

The Road to LLM: What is a Tokenizer? [Day 3] – Bilel KLED
Source: https://kbilel.com/2024/12/03/road-to-llm-what-is-a-tokenizer-day-3/

AI Model Training occurs in phases:

  1. Pre-training (the "Shoggoth"): Massive unsupervised next-token prediction on internet data β†’ learns patterns but no alignment.
  2. Supervised Finetuning (SFT): Instruction-response pairs teach assistant behavior.
  3. Preference Alignment (RLHF/DPO): Rankings make outputs helpful, harmless, honest.

Full transformer block flow with token embeddings + attention + output logits:

3] Journey of a single token through the LLM Architecture | Mayank Pratap  Singh ML Blogs
Source: https://blogs.mayankpratapsingh.in/blog/journey-of-a-single-token-through-the-LLM-Architecture


2026 Top Models:

  • Claude 4.5 Sonnet β€” Ethical, hybrid reasoning, 1M token context.
  • Gemini 2.5 Pro β€” Multimodal (text+video+audio), sparse MoE, "thinking model".
  • Grok 4.1 β€” Unfiltered, real-time knowledge, strong math/coding.
  • Llama 4 β€” Open, MoE architecture (Scout/Maverick variants).
  • Mistral Large 3 β€” Portable, efficient on edge.



Chapters 3–4: Evaluation – From Vibes to Production Reliability

Evaluating AI systems (especially large language models like GPT, Claude, Gemini, or Llama in 2026) is much harder than traditional apps or old-school ML models. Here's a simple, step-by-step explanation of why it's tricky and how people actually do it in the real world β€” plus one perfect GIF to show the core idea.

Why Is Evaluating AI So Hard?

  1. No single "correct" answer
    • Old ML: "Is this photo a cat?" β†’ Yes/No β†’ easy accuracy score.
    • Modern AI: "Summarize this 10-page report" or "Explain quantum computing like I'm 12" β†’ many good ways to answer. No exact match possible. It's open-ended β†’ we call this "vibes-based" if we just eyeball it.
  2. Models cheat by memorizing (data contamination)
    • Many popular tests (like MMLU = multiple-choice questions about school subjects) are all over the internet.
    • Big models read almost the entire internet during training β†’ they sometimes just remember the answers instead of truly understanding. β†’ Scores look amazing (90%+), but it's fake intelligence β€” like a student who memorized the exam questions.
  3. Hallucinations hide easily
    • The AI can confidently say wrong things that sound right. Hard to catch automatically.

How Do Real Teams Evaluate AI in 2026?

Method 1: Exact Match (only for simple stuff)

  • Works when there's one right answer: math problems, multiple-choice, code that passes tests.
  • Example: GPQA Diamond benchmark β†’ super-hard science questions humans get ~30–40% right. Gemini models now score ~94% β†’ real progress (not just memorization).

Method 2: AI as a Judge (most popular today)

  • Use a strong model (like Claude 4.6 or GPT-5 level) to grade weaker ones.
  • You give it a clear rubric (scoring guide):
    • Is it helpful? (1–5)
    • Is it faithful / no hallucination? (Did it stick to facts or make stuff up?)
    • Is it polite / safe?
  • Super simple prompt example: "Here is the user's question, the provided facts, and the AI's answer. Rate faithfulness 1–10. Only use information from the facts."

Method 3: Holistic Pipeline Check (for full RAG systems)

When you build a real product (chatbot that searches company docs), check 3 things:

  • Context precision β†’ Did we find the right documents? (aim >95% recall@10 = top 10 results include the good stuff)
  • Faithfulness β†’ Did the answer only use the retrieved info? (no made-up facts)
  • Answer relevance β†’ Did it actually solve the user's question?
LLM-as-a-judge: a complete guide to using LLMs for evaluations

Example: LinkedIn customer support chat β†’ They use RAG (search internal knowledge + knowledge graph). β†’ Evaluation β†’ AI judge + some human checks β†’ 28.6% faster ticket resolution. β†’ They track: Did the answer fix the issue? Was it polite? No hallucinations about company policy?



Chapter 5: Prompt Engineering – Production-Grade Mastery (Simple 2026 Explanation)

Prompt engineering is basically how you talk to the AI so it gives you the best, most reliable answers β€” especially in real apps where mistakes cost money or time.

Anatomy of a Good Prompt (Breakdown – Super Simple)

A strong prompt usually has a few parts:

  1. System prompt (the "persona/role") Tells the AI who it is. Example: "You are a senior software engineer with 10+ years of experience in backend APIs. Always respond in valid JSON only."
  2. Few-shot examples (5–10 good ones) Show-don't-tell. Give 5–10 real input β†’ perfect output pairs so the AI learns the pattern. This is huge in 2026 β€” zero-shot (no examples) often fails on tricky formats; few-shot jumps accuracy from 60–70% to 95%+.

Advanced Techniques in 2026:

1. Chain-of-Thought (CoT) + Self-Reflection

Tell the AI: "Think step-by-step before answering."

This makes it write out reasoning first β†’ huge boost on math, logic, coding (30–50% better on hard problems with models like Gemini 3.1 Pro or Claude 4.6).
Self-reflection = ask it to check its own answer afterward ("Is this correct? Fix if wrong").

3] Prompt Engineering: Mastering Chain-of-Thought, HyDE, and Step-Back  Techniques | Mayank Pratap Singh ML Blogs
Source: https://blogs.mayankpratapsingh.in/blog/chain-of-thought-hyde-step-back-techniques

2. Self-Consistency

Ask the same question 8–16 times (with slight rephrasing or temperature).
Take the most common answer (majority vote).

Great for math/reasoning β€” reduces random errors.

5 Advanced Prompting Techniques to ace ChatGPT | by NexGen Architects -  Your MuleSoft Advantage | Medium
Source: https://medium.com/@nexgenarch/5-advanced-prompting-techniques-to-ace-chatgpt-ac750aa2e01e

Real-World Example: Banking API JSON Extraction


Problem: You have messy customer messages like:
"Send 5000 to account 12345678 please" or "Transfer β‚Ή10,000 INR from savings to credit card ending 4567"

You need perfect JSON every time:

JSON

{"action": "transfer", "amount": 10000, "currency": "INR", "from": "savings", "to": "credit_card", "last4": "4567"}


Solutions:

Zero-shot
(just "Extract to JSON"): β†’ 60–75% success, misses formats, hallucinates fields.

Few-shot (give 5–10 perfect messy β†’ JSON examples in prompt): β†’ Mistral-7B or Llama-4-8B hits 98%+ even on weird edge cases (typos, mixed languages, abbreviations).
Many Indian fintechs (Paytm-like apps, Razorpay internal tools) do exactly this in 2026 β€” few-shot + delimiters + strict schema validation.


Chapter 6: RAG & Agents – Grounding Knowledge & Autonomy

RAG and Agents are the two biggest ways companies make AI actually useful and trustworthy in real products in 2026 β€” instead of just hallucinating or being stuck with old knowledge.

What is RAG?

RAG = Retrieval-Augmented Generation

= "Look up real facts first, then answer using them."

Why needed?

  • LLMs like Grok 4.1 or Claude 4.6 have knowledge cutoffs and can confidently make up stuff (hallucinations).
  • RAG fixes this by searching your own documents/company data before answering.

How the 2026 Production RAG Pipeline Works

A Deep Dive into Retrieval Augmented Generation (RAG) | by Anbukkarasu |  Medium
Source: https://medium.com/@anbukkarasuak/a-deep-dive-into-retrieval-augmented-generation-rag-d7c1e786d661
  1. Chunking β€” Break big PDFs/docs into small pieces (200–512 tokens). Use semantic chunking (group by meaning) + fixed size.
  2. Embed β€” Turn each chunk into a vector (number array) using good embedders like Mistral embeddings or OpenAI text-embedding-3-large.
  3. Store in Vector DB β€” Save vectors in Pinecone, Weaviate, Chroma, or Elasticsearch.
  4. Hybrid Search β€” When user asks question:
    • Convert question to vector β†’ find similar chunks (semantic search).
    • Also do keyword search (BM25) for exact matches.
    • Combine both.
  5. Reranker β€” Take top 50 results β†’ use a smart reranker (Cohere Rerank or BGE-reranker) to pick the true best 5–10.
  6. Generation β€” Stuff those best chunks into prompt β†’ ask LLM (Claude/Gemini/Mistral) to answer grounded in them only.

[You see files go in, get vectorized, query comes, smart search finds relevant pieces, LLM generates grounded answer.]

πŸ” Retrieval-Augmented Generation (RAG) with LangChain, ChromaDB, and FAISS  β€” A Complete Guide | by Saubhagya Vishwakarma | Medium
Source: https://medium.com/@saubhagya.vishwakarma113393/retrieval-augmented-generation-rag-with-langchain-chromadb-and-faiss-a-complete-guide-63ad903a237a

Real-World Examples Everyone Uses in 2026

  • Wealth management firms (like Indian ones – Zerodha, Groww internal tools) β†’ Advisor asks: "Best mutual fund for 35-year-old moderate risk?" β†’ RAG pulls latest SEBI regs, fund performance data, client profile β†’ grounded recommendation.
  • Hospital networks / clinical decision support β†’ Doctor queries patient history + latest guidelines. β†’ RAG β†’ 30% fewer misdiagnoses (real stat from 2025–2026 deployments) because it sticks to verified medical knowledge.

What are Agents? (Super Simple)

Agents = AI that can think, use tools, and loop until done β€” instead of one-shot answers.

Most common pattern: ReAct (Reason + Act)

  • Reason (think): "What should I do next?"
  • Act (use tool): Call calculator, search web, check calendar, run code, etc.
  • Observe (see result): Get output from tool.
  • Repeat until good answer β†’ then give final response.

Frameworks people actually use

  • LangGraph β†’ Builds stateful loops/graphs (very popular for complex agents).
  • LlamaIndex β†’ Great for RAG grounding + agents.
  • CrewAI β†’ Easy multi-agent teams (one researches, one writes, one reviews).

Real-World Example:

Travel booking bot
(like MakeMyTrip / Cleartrip internal agent) β†’ User: "Book flight to Goa next weekend." β†’ Agent:

    1. Reason: Need dates, budget?
    2. Act: Check user calendar (tool).
    3. Observe: Free Sat-Sun.
    4. Act: Call flight API.
    5. Observe: Prices.
    6. Ask user: "Morning or evening?" β†’ loop again β†’ finally books.
Implementing ReAct Agentic Pattern From Scratch
Source: https://www.dailydoseofds.com/ai-agents-crash-course-part-10-with-implementation/

Quick start roadmap:
Start with simple RAG (LlamaIndex + Chroma + Mistral-7B) for your side project/company docs. Add agentic loop (LangGraph) only when you need tools (APIs, search). Test faithfulness hard β€” hallucinations kill trust.

Chapter 7: Finetuning – Efficient Specialization

Finetuning means taking a big pre-trained model (like Llama 4, Mistral 3, or Grok 4.1) and teaching it to be better at your specific job β€” without starting from scratch.

In 2026, almost no one does full finetuning anymore (updating every single parameter) because it's super expensive in GPU memory and time. Instead, we use efficient methods like LoRA and QLoRA β€” tiny changes that give almost the same results but cost 10–100x less.

The Process:

  • Use RAG for new facts/knowledge (e.g., latest company policies).
  • Finetune for style, format, behavior β€” things like:
    • Always answer in strict JSON
    • Speak like a polite Indian customer support agent
    • Follow your company's legal tone
    • Never say certain risky things

Finetuning changes the model's "personality" and reliability on your domain.

The Big Idea: PEFT (Parameter-Efficient Finetuning)

PEFT = Train only a tiny fraction of parameters (<1%) instead of the whole model.

a) LoRA (Low-Rank Adaptation) β€” The king in 2026

  • Freeze the big model (don't touch its billions of weights).
  • Add small "adapter" layers (A and B matrices) to certain spots (like attention/feed-forward layers).
  • Train only those tiny adapters β†’ Ξ”W = A Γ— B (low-rank math).
  • Result: You train ~0.1–1% of params β†’ huge memory & speed win.
  • After training: Merge adapters back into base model (or keep separate for switching tasks).

Here's a clear visual comparing full finetuning vs LoRA β€” full updates everything (huge cost), LoRA adds tiny plugins (fast & cheap):

LoRA: Low-Rank Adaptation of Large Language Models Explained | DigitalOcean
Source: https://www.digitalocean.com/community/tutorials/lora-low-rank-adaptation-llms-explained


LoRA math flow (frozen weights + low-rank A/B adapters added during training, merged after):

Understanding LoRA: Low Rank Adaptation | by Vikram Pande | Medium
Source: https://medium.com/@vikrampande783/understanding-lora-low-rank-adaptation-563978253d6e

b) QLoRA = LoRA + 4-bit quantization

  • Quantize base model to 4-bit (NF4/AWQ) β†’ fits 70B model on single A100/H200 or even consumer RTX 4090.
  • Use paged optimizers β†’ no memory spikes during training.
  • Train LoRA adapters on top β†’ same quality as full 16-bit but 70–90% less VRAM.

Memory comparison (full vs LoRA vs QLoRA) β€” QLoRA wins for low-resource setups:

Fine Tuning LLM with QLoRA | Medium
Source: https://medium.com/@dillipprasad60/qlora-explained-a-deep-dive-into-parametric-efficient-fine-tuning-in-large-language-models-llms-c1a4794b1766

Data for Finetuning (Quality Beats Quantity)

  • Need instruction format pairs: {"instruction": "Extract invoice details", "input": "Invoice #123...", "output": perfect JSON}
  • 500–5,000 high-quality examples > 50,000 noisy ones (LIMA research still holds).
  • Use synthetic data: Ask Claude 4.6 / Gemini 3.1 to generate 1,000 variations from 10 good seeds β†’ filter bad ones.

Real-World Examples in 2026

  • Banking/Fintech (e.g., Razorpay, Paytm internal tools) β†’ Need perfect JSON for legacy API calls. β†’ Few-shot prompting fails on edge cases (weird currencies, typos). β†’ Finetune Mistral-7B or Llama-4-8B with LoRA on 800 high-quality request β†’ JSON pairs. β†’ Result: 99.9% syntax perfect, no hallucinations on format β†’ deploy as tiny adapter.
  • Medical / Healthcare chatbots β†’ Finetune on PubMed abstracts + doctor Q&A pairs (LoRA on BioMistral or MedLlama). β†’ Combine with RAG for latest papers β†’ safer, more accurate clinical suggestions.
Quick start roadmap:
Use Hugging Face PEFT library + Unsloth (2–5x faster) + QLoRA on a rented A100 (RunPod ~β‚Ή200–300/hr). Start with 500–1k examples in Alpaca/ShareGPT format. Test on 100 holdout cases β€” aim for >95% on your format/style. Merge adapter β†’ serve with vLLM for speed.


Chapter 8: Dataset Engineering – The Real Heavy Lifting

If finetuning is the "coaching" part, dataset engineering is building the perfect training material first. In 2026, people realize: the dataset is harder and more important than the model itself. Bad data = bad model, no matter how fancy the architecture. Good data = even a smaller model crushes bigger ones.

This chapter is about curating, cleaning, filtering, deduplicating, and generating high-quality data for pre-training, SFT, or finetuning LLMs.

Why Dataset Engineering Matters So Much

  • Raw internet data (Common Crawl) is messy: duplicates, toxic stuff, PII (personal info), low-quality spam, boilerplate.
  • Training on junk makes models memorize garbage, hallucinate more, or overfit to repeats.
  • Quality > Quantity: Research (like LIMA, Phi series) shows 1,000–10,000 excellent examples beat 1 million noisy ones.
  • 2026 frontier datasets (Nemotron-CC, FineWeb-Edu, Dolma) are trillions of tokens but heavily curated β†’ that's why models keep improving.

Core Steps in the Data Curation Pipeline (Simple Breakdown)

  1. Collection / Ingestion β€” Grab raw data (Common Crawl dumps, internal docs, code repos, etc.).
  2. Preprocessing β€” Extract clean text, remove HTML/JS, detect language, normalize.
  3. Filtering β€” Throw out junk:
    • Heuristics (short docs, too many symbols, PII like emails/phone numbers).
    • Classifier-based (use small LLM like Mistral-7B or DeBERTa to score quality, toxicity, educational value).
  4. Deduplication β€” Remove exact/near-duplicates so model doesn't over-memorize phrases.
    • Tools: MinHash + LSH (Locality-Sensitive Hashing) β€” fast for trillions of docs.
  5. Blending / Bucketing β€” Mix sources (Wikipedia-style high quality + diverse web) in right ratios.
  6. Synthetic Generation (if needed) β€” Use strong LLM (Claude 4.6, Gemini 3.1) to create new examples.
Building Nemotron-CC, A High-Quality Trillion Token Dataset for LLM  Pretraining from Common Crawl Using NVIDIA NeMo Curator | NVIDIA Technical  Blog
Source: https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/

Deduplication – The Silent Killer Fix

Duplicates waste compute and cause overfitting (model repeats exact sentences).

  • Exact dedup: Easy with hashes.
  • Near-duplicate (paraphrases, minor changes): Use MinHash + LSH β†’ cluster similar docs fast.

Synthetic Data & Distillation – When Real Data Runs Out

  • Self-Instruct / Synthetic: Seed with 10–50 good examples β†’ ask strong LLM to generate 1,000+ new Synthetic ones (instructions + responses).
    • Then filter: LLM judge for quality, check if code compiles, etc.
  • Distillation: Teacher (big model like GPT-5 level / Claude 4.6) generates outputs on prompts β†’ Student (smaller like Mistral-7B or Llama-4-8B) learns from them.
    • Cheaper, faster, transfers knowledge without huge raw data.
How DeepSeek Built Cheaper, Faster AI Using Knowledge Distillation - Doballi
Source: https://blog.doballi.com/how-deepseek-built-cheaper-faster-ai-using-knowledge-distillation/

Distillation Process

Small Is the New Smart: Revolutionizing AI with Model Distillation | by  Afef Belhadj | Medium

Example: Proprietary language "SuperCode" (your company's internal DSL) β†’ No public data exists. β†’ Write 10–20 manual examples. β†’ Feed to Claude 4.6 β†’ generate 2,000 variations. β†’ Filter: Run compiler on synthetic code β†’ keep only ones that compile + pass tests. β†’ Result: High-quality 1k–5k dataset β†’ finetune Mistral-7B β†’ specialized coder.

Quick start roadmap: Start small. Use tools like Hugging Face Datasets + Unsloth for loading/cleaning. Dedup with datasketch MinHashLSH. Generate synthetic with Grok/Claude API. Always run quality checks (perplexity, diversity metrics, manual spot-check). Quality dataset + QLoRA = killer combo.

Chapter 9: Inference Optimization – Serving at Scale

Inference is the part where your trained/finetuned model actually runs in production β€” answering real user queries, fast and cheap.

In 2026, the biggest bottlenecks aren't compute power anymore β€” it's memory bandwidth (moving huge model weights from VRAM to the GPU cores takes longer than doing the math) and KV cache management (storing past attention for long conversations).

The goal: Serve thousands of users at once, with low latency (<1–2 sec first token), low cost (β‚Ή0.1–1 per 1M tokens), and high throughput (tokens/sec).

Key Bottlenecks in 2026

  1. Memory Wall β€” Models like Llama 4 70B need ~140 GB in FP16 β†’ even quantized, moving 70–140 GB around every second is slow.
  2. KV Cache Explosion β€” For long chats (10k+ tokens context), KV cache grows linearly β†’ eats VRAM fast, limits concurrency.
  3. Prefill vs Decode β€” First token (prefill) reads whole prompt (slow), then decode (one token at a time) is autoregressive (can be fast but wastes compute if not batched).

Main Optimization Techniques:

1. Quantization β€” Shrink numbers from 16-bit β†’ 8-bit β†’ 4-bit (or even 2-bit in research)

  • FP16 β†’ INT8/AWQ/GPTQ β†’ 2Γ— smaller model, ~2Γ— faster, almost no quality drop.
  • QLoRA/4-bit β†’ 70B fits in 35–50 GB VRAM (single H200/A100).
  • 2026 sweet spot: AWQ or GPTQ-Int4 for most production.

Optimizing LLMs for Performance and Accuracy with Post-training Quantization  - Edge AI and Vision Alliance
Source: https://www.edge-ai-vision.com/2025/08/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/


2. KV Cache Management – The Real Hero

  • Reuse past key/value vectors instead of recomputing attention every token.
  • PagedAttention (vLLM invention, now standard) β€” Treat KV cache like OS virtual memory: page it out to CPU if needed, swap back β†’ massive concurrency boost (5–20Γ— more users per GPU).
Understanding KV Cache and Paged Attention in LLMs: A Deep Dive into  Efficient Inference | by Dewang Sultania | My musings with LLMs | Medium
Source: https://medium.com/my-musings-with-llms/understanding-kv-cache-and-paged-attention-in-llms-a-deep-dive-into-efficient-inference-62fa372432ce

3. Continuous / Dynamic Batching

  • Don't wait for fixed batch size.
  • Add new requests mid-generation β†’ process many users together β†’ read weights once, use on many prompts.
  • vLLM / TGI do this automatically β†’ 5–10Γ— throughput jump.

Batching ⚑Lightning AI - Docs
Source: https://lightning.ai/docs/litserve/features/batching

4. Speculative Decoding

  • Small "draft" model guesses 4–8 tokens ahead super fast.
  • Big model verifies in parallel (one forward pass checks all).
  • If correct β†’ free speedup (2–3Γ— tokens/sec).
  • 2026 common: Medusa, Lookahead, or Eagle spec decoders on top of Llama/Mistral.

    The Process:
An Introduction to Speculative Decoding for Reducing Latency in AI  Inference | NVIDIA Technical Blog
Source: https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

Example: Indian startups / enterprises (Zomato, Swiggy, Paytm-like scale) β†’ Serve Llama-4-70B or Mistral-3 with vLLM + 4-bit AWQ on RunPod/Nebius A100s. β†’ Before: 2–3 concurrent users per GPU. β†’ After PagedAttention + continuous batching: 20–50+ concurrent. β†’ Cost drop: ~4–10Γ— cheaper per query.

Chapter 10: Architecture & User Feedback – The Long Game

In 2026, no serious AI system is static. The winners build data flywheels β€” user interactions create better data β†’ better data improves the model/RAG/prompts β†’ better outputs β†’ more users β†’ more data β†’ loop forever. This is how DoorDash, LinkedIn, Notion, and Indian unicorns (Zomato, Groww, Cred) keep their AI ahead.

Core Concepts –

A) The Data Flywheel

Every time a user interacts, you collect signals

The Data Flywheel: Why AI Products Live or Die by User Feedback | by Mahesh  | Medium
Source: https://mrmaheshrajput.medium.com/the-data-flywheel-why-ai-products-live-or-die-by-user-feedback-4ae7aab32d4d

a. Implicit feedback β€” Did they copy the answer? Edit it? Regenerate? Abandon the chat? Stay longer?

b. Explicit feedback β€” Thumbs up/down, rating 1–5, "this was helpful" button. β†’ Save good/bad examples β†’ build "golden dataset" β†’ use for:

      • Evaluating new model versions
      • Finetuning adapters
      • Improving RAG retrieval (better chunking, reranking)
      • Refining prompts


B) Gateway / Router Pattern

Never connect your app directly to one LLM API. Put a smart middle layer (gateway/router) that:Router architecture overview (query β†’ classifier β†’ route to different models/providers β†’ combine response):Multi-model routing with cost/latency trade-offs (simple queries to fast/cheap, complex to reasoning-heavy):

    • Classifies query difficulty (simple vs complex reasoning)
    • Routes: easy β†’ cheap/fast model (DeepSeek V3, Mistral 3, Llama-4-8B), hard β†’ expensive/smart (Claude 4.6 Opus, Gemini 3.1 Pro)
    • Fallback: If one provider is down/slow/expensive β†’ switch automatically
    • Rate-limit per user/company β†’ prevent budget blow-up
    • Log everything for later analysis

Book download link: https://oceanofpdf.com/genres/technical/pdf-epub-ai-engineering-building-applications-with-foundation-models-download/

πŸ’¬ Join the DecodeAI WhatsApp Channel
Get AI guides, bite-sized tips & weekly updates delivered where it’s easiest – WhatsApp.
πŸ‘‰ Join Now