Generative AI for Business — Week 1

Foundations of Generative AI

Transformers, Attention, and Diffusion Models

Week 1

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Today's agenda

Time Topic
0:00–0:30 What is Generative AI?
0:30–1:10 Transformer architecture
1:10–1:35 Diffusion models
1:35–1:50 Break
1:50–2:10 Hands-on: Setup + Tokenization
2:10–2:40 Hands-on: Attention visualization
2:40–3:00 Wrap-up + next week preview
JHU Carey Business School | 2026
Generative AI for Business — Week 1

What is Generative AI?

Discriminative models: Given input X, predict label Y

  • Spam classifier, sentiment analysis, fraud detection

Generative models: Learn the distribution of data, generate new samples

  • Text, images, audio, video, code

The shift: from "classify this" to "create this"

JHU Carey Business School | 2026
Generative AI for Business — Week 1

The timeline

2014        2017        2020        2022        2023-now
 │           │           │           │           │
 GANs     Attention    GPT-3     ChatGPT    GPT-4, Claude,
         Is All You   (scaling)  (RLHF +    Gemini, open
          Need                   product)   source boom

Key inflection points:

  • 2017: Transformer architecture (Vaswani et al.)
  • 2020: Scaling laws demonstrated (Kaplan et al.)
  • 2022: RLHF makes models usable by non-experts
  • 2024–25: Reasoning models, agents, multimodal
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Why transformers won

Previous approaches (RNNs, LSTMs):

  • Process tokens sequentially → slow to train
  • Struggle with long-range dependencies
  • Hard to parallelize across GPUs

Transformers:

  • Process all tokens in parallel via attention
  • Scale efficiently with more compute and data
  • Single architecture works across text, image, audio, code
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Tokenization

Text → Tokens → Numbers → Model → Numbers → Text

"Hello, world!" → ["Hello", ",", " world", "!"] → [9906, 11, 1917, 0]

Why it matters:

  • Determines what the model "sees"
  • Affects cost (you pay per token)
  • Different languages tokenize differently (efficiency varies 2-10x)
  • Context window = max tokens, not max words
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Embeddings

Tokens → dense vectors in high-dimensional space

  • Similar meanings → nearby vectors
  • Captures semantic relationships
  • king − man + woman ≈ queen
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Self-attention: the intuition

Question: When reading "The cat sat on the mat because it was tired"
— what does "it" refer to?

Self-attention lets each token look at every other token and decide which ones matter.

For "it":

  • High attention to "cat" ✓
  • Low attention to "mat" ✗
  • This is learned, not programmed
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Self-attention: the mechanics

Three vectors per token:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

  1. Q and K dot product → relevance scores
  2. Softmax → normalize to probabilities
  3. Weighted sum of V → context-aware representation
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Multi-head attention

One attention head = one "perspective"

Multiple heads in parallel → the model attends to different things simultaneously:

  • Head 1: syntactic relationships (subject-verb)
  • Head 2: semantic relationships (pronoun-referent)
  • Head 3: positional patterns (nearby words)

Results are concatenated and projected → richer representation

JHU Carey Business School | 2026
Generative AI for Business — Week 1

The full transformer block

Input Embeddings + Positional Encoding
        │
   ┌────▼────┐
   │ Multi-Head │
   │ Attention  │ ──── + residual connection
   └────┬────┘
        │
   Layer Norm
        │
   ┌────▼────┐
   │ Feed     │
   │ Forward  │ ──── + residual connection
   └────┬────┘
        │
   Layer Norm
        │
      Output

Stack N of these → a transformer

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Encoder vs. decoder

Encoder Decoder
Attention Bidirectional (sees all tokens) Causal (sees only past tokens)
Use case Understanding (classification, embedding) Generation (text, code)
Example BERT GPT, Claude

Encoder-decoder (original transformer): translation, summarization
Decoder-only (modern LLMs): general-purpose generation

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Pre-training → Fine-tuning → RLHF

Pre-training (unsupervised)

  • Predict next token on massive text corpora
  • Learns language, facts, reasoning patterns
  • Expensive: millions of dollars, weeks of GPU time

Fine-tuning (supervised)

  • Train on curated (prompt, response) pairs
  • Adapts to specific tasks or styles

RLHF (reinforcement learning from human feedback)

  • Humans rank model outputs → train a reward model
  • Model optimizes for human preferences
  • This is what makes ChatGPT feel "helpful"
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Scaling laws

More data + more parameters + more compute = predictably better performance

Business implications:

  • Larger models are expensive but more capable
  • The curve hasn't plateaued yet
  • Cost per token is dropping fast
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Diffusion models

A completely different approach to generation (primarily images).

Forward process: Gradually add noise to a real image until it's pure noise

Reverse process: Train a neural network to denoise — step by step, recover the image

Real image → 🔊🔊🔊🔊🔊🔊 → Pure noise
Pure noise → 🧹🧹🧹🧹🧹🧹 → Generated image

Key insight: denoising is easier to learn than generating from scratch

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Text-to-image pipeline

"A cat wearing a top hat, oil painting style"
           │
     Text Encoder (CLIP)
           │
     Maps text → latent space
           │
     Diffusion Model (U-Net)
           │
     Iterative denoising in latent space
           │
     VAE Decoder
           │
     Final image

This is how Stable Diffusion, DALL-E 3, and Midjourney work (with variations).

JHU Carey Business School | 2026
Generative AI for Business — Week 1

The landscape today

Modality Key models
Text GPT-4o, Claude 4, Gemini 2, Llama 3, DeepSeek
Image DALL-E 3, Midjourney, Stable Diffusion 3, Flux
Video Sora, Runway Gen-3, Kling
Audio Whisper, ElevenLabs, Suno
Code Claude, GPT-4o, Codex, Cursor
Multimodal GPT-4o, Gemini, Claude (vision + text)
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Break

15 minutes

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Hands-on

Setup + Exercises

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Setup check

By now you should have:

  • [ ] Python 3.10+ installed
  • [ ] Claude Code CLI installed (claude command works)
  • [ ] Anthropic API key set (ANTHROPIC_API_KEY)
  • [ ] Course repo cloned

Quick test:

claude "What is 2 + 2?"

If that works, you're ready.

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Exercise 1: Tokenization Explorer

Open your terminal in the course repo:

cd scripts/week1
claude "Read tokenizer_explore.py and explain what it does"

Then try:

  • Run the script with different sentences
  • Ask Claude Code to add a comparison between English and another language
  • Ask Claude Code to add a cost estimator (at $3/M input tokens)

Discussion: What surprised you about tokenization?

JHU Carey Business School | 2026
Generative AI for Business — Week 1

Exercise 2: Attention Visualization

claude "Read attention_viz.py, explain it, then run it"

The script loads a small transformer and visualizes attention patterns.

MS track:

  • Modify the script to compare attention across different layers
  • What changes between early and late layers?

MBA track:

  • Use Claude Code to interpret the visualization
  • Write a 3-sentence explanation of what the model is "paying attention to"
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Next week preview

Week 2: Generative AI in Action (I)

  • Foundation model landscape (open vs. closed, large vs. small)
  • The GenAI ecosystem: infrastructure, models, applications
  • Choosing the right model for your use case
  • Hands-on: model comparison + build a simple GenAI tool

Reading:

  • Bommasani et al., "On the Opportunities and Risks of Foundation Models" — Sections 1-2
  • Browse model cards on Hugging Face
JHU Carey Business School | 2026
Generative AI for Business — Week 1

Questions?

JHU Carey Business School | 2026

This is a placeholder — replace with your own embedding diagram

MS: walk through the math. MBA: focus on the intuition above.

Placeholder — replace with scaling law chart