One attention head = one "perspective"
Multiple heads in parallel → the model attends to different things simultaneously:
Results are concatenated and projected → richer representation
Input Embeddings + Positional Encoding
│
┌────▼────┐
│ Multi-Head │
│ Attention │ ──── + residual connection
└────┬────┘
│
Layer Norm
│
┌────▼────┐
│ Feed │
│ Forward │ ──── + residual connection
└────┬────┘
│
Layer Norm
│
Output
Stack N of these → a transformer
| Encoder | Decoder | |
|---|---|---|
| Attention | Bidirectional (sees all tokens) | Causal (sees only past tokens) |
| Use case | Understanding (classification, embedding) | Generation (text, code) |
| Example | BERT | GPT, Claude |
Encoder-decoder (original transformer): translation, summarization
Decoder-only (modern LLMs): general-purpose generation
Pre-training (unsupervised)
Fine-tuning (supervised)
RLHF (reinforcement learning from human feedback)
More data + more parameters + more compute = predictably better performance
Business implications:
A completely different approach to generation (primarily images).
Forward process: Gradually add noise to a real image until it's pure noise
Reverse process: Train a neural network to denoise — step by step, recover the image
Real image →
→ 
→ 

→ Pure noise
Pure noise →
→ 
→ 

→ Generated image
Key insight: denoising is easier to learn than generating from scratch
"A cat wearing a top hat, oil painting style"
│
Text Encoder (CLIP)
│
Maps text → latent space
│
Diffusion Model (U-Net)
│
Iterative denoising in latent space
│
VAE Decoder
│
Final image
This is how Stable Diffusion, DALL-E 3, and Midjourney work (with variations).
| Modality | Key models |
|---|---|
| Text | GPT-4o, Claude 4, Gemini 2, Llama 3, DeepSeek |
| Image | DALL-E 3, Midjourney, Stable Diffusion 3, Flux |
| Video | Sora, Runway Gen-3, Kling |
| Audio | Whisper, ElevenLabs, Suno |
| Code | Claude, GPT-4o, Codex, Cursor |
| Multimodal | GPT-4o, Gemini, Claude (vision + text) |
By now you should have:
claude command works)ANTHROPIC_API_KEY)Quick test:
claude "What is 2 + 2?"
If that works, you're ready.
Open your terminal in the course repo:
cd scripts/week1
claude "Read tokenizer_explore.py and explain what it does"
Then try:
Discussion: What surprised you about tokenization?
claude "Read attention_viz.py, explain it, then run it"
The script loads a small transformer and visualizes attention patterns.
MS track:
MBA track:
Week 2: Generative AI in Action (I)
Reading:
This is a placeholder — replace with your own embedding diagram
MS: walk through the math. MBA: focus on the intuition above.
Placeholder — replace with scaling law chart