Welcome back. This week we zoom out from architecture to the landscape: who's building what, how the pieces fit together, and how you make model selection decisions for real business problems.
The term "foundation model" was coined by the Stanford HAI group in 2021. The key insight: instead of training a separate model for each task, you train one massive model on broad data and adapt it. This is a paradigm shift from traditional ML where you'd have a different model for sentiment analysis, translation, classification, etc.
The landscape is bifurcated. Closed-source models from OpenAI, Anthropic, Google are typically the most capable but you're dependent on their API. Open-weight models from Meta, Mistral, DeepSeek give you more control — you can host them yourself, fine-tune them, inspect them. The gap in capability is narrowing. For many business tasks, open models are now competitive.
Most companies start with closed APIs because they're easiest to get going. As they scale and have specific needs — data residency, fine-tuning, cost optimization — they often adopt open models for some workloads. The mature approach is a portfolio: use the best model for each task. Some tasks need frontier capability, others just need a fast cheap model.
This is a critical business insight. Many companies default to the biggest model because they want "the best." But for high-volume tasks like classification, extraction, or routing, a small model at 1/50th the cost can be just as good. The right approach: start with the biggest model to establish a quality baseline, then try to match that quality with a smaller model. Use frontier models for the hard stuff.
The trend is clear: models are becoming multimodal. GPT-4o can see, hear, and speak. Claude can read images and PDFs. Gemini processes video. This matters for business because it means a single API can handle text, image, and audio tasks. The modality walls are coming down. Think about what this means for document processing, customer service, content creation.
This is the GenAI value chain from the Wu & Higgins reading. Infrastructure is where the money was made first — NVIDIA's stock price. Models sit on top, then orchestration tools, then applications. The key insight for business strategy: value capture is shifting upward. The infrastructure is commoditizing, model capabilities are converging, and the real differentiation is in the application layer — how you use these models to solve specific problems. This is where most of you will operate.
Training and inference are fundamentally different businesses. Training requires massive upfront capital — hundreds of millions for frontier models. Only a handful of organizations do this. Inference is the ongoing cost of using models. This is what you pay as a business when you make API calls. The training vs inference distinction matters for understanding the economics: training costs are fixed and declining per unit of capability, inference costs are variable and directly tied to usage.
Let's make this concrete. A typical business query might cost 1-2 cents with a frontier model. That sounds trivial, but it compounds fast. At enterprise scale — 100K queries per day — you're looking at $730K per year just for model API costs. This is why model selection matters. If a mid-tier model can do the job at 1/5th the cost, that's a $600K annual savings. Always prototype with the best model, then optimize.
Three ways to use the model layer. Most businesses start with pre-trained models via API, customizing behavior through prompting. When prompting hits its limits — domain-specific terminology, particular output formats, specialized reasoning — you fine-tune. Distillation is the next optimization: use a big model to generate training data, then train a small model to mimic it. This gives you frontier-quality outputs at small-model costs.
Three dominant application patterns, arranged by autonomy level. Chatbots are reactive — user asks, AI answers. Copilots are collaborative — user works, AI assists alongside. Agents are proactive — they act autonomously with minimal human input. Most enterprise deployments today are chatbots and copilots. Agents are the frontier — higher value but harder to get right. We'll spend Weeks 4-5 on agents.
LLMOps is the operational layer that most people forget about. You need prompt versioning (your prompts are code), a gateway to handle rate limits and caching, evaluation and monitoring of outputs, cost dashboards, and feedback loops. This is analogous to MLOps but with new challenges: non-deterministic outputs, no clear accuracy metric, fast-changing model landscape. Companies that skip this step end up with fragile, expensive, unmonitored AI deployments.
Here's a practical 2x2 for model selection. On one axis: how complex is the task? Simple extraction vs. complex reasoning. On the other: how sensitive is the data? Public info vs. PII or trade secrets. Low complexity + low sensitivity = small cloud model, cheapest option. High complexity + high sensitivity = either frontier model with enterprise agreements, or fine-tune an open model you host yourself. Most business tasks fall in the middle, which is why mid-tier models like Sonnet are so popular.
The build-buy spectrum for GenAI. Buying a SaaS product (like Jasper for marketing copy) is fastest but least differentiated. Building your own fine-tuned model is most powerful but takes months. The sweet spot for most companies is the middle: use foundation model APIs with custom prompts and orchestration. This is what we'll teach you to do in this course. You can build sophisticated AI applications without training a single model.
Let's work through this together. Email auto-reply: high volume, medium accuracy — mid-tier or small model, maybe fine-tuned on your email style. Cost matters here. Legal contract review: low volume but very high stakes — frontier model, possibly with human-in-the-loop, enterprise API agreement for data protection. Meeting summaries: moderate volume, low data sensitivity — could use almost anything, probably mid-tier model via API, maybe even a small model. The point: one company, three different model strategies.
Vendor lock-in is real in GenAI. Your prompts get tuned to a specific model's behavior. Your evaluation benchmarks are calibrated to its outputs. Your users get used to its personality. Switching costs are higher than they appear. Mitigation: build an abstraction layer so you can swap models. Test against multiple models periodically. The MCP standard we'll cover in Week 5 is specifically designed to address tooling lock-in.
This exercise makes the model selection discussion concrete. You'll send the same prompt to multiple models and compare the outputs on quality, cost, and latency. Use Claude Code to modify the script — add models, change the task, add evaluation criteria. The goal is to develop intuition for how models differ in practice, not just on benchmarks.
This is your first real build. Start with the skeleton in simple_app.py, then use Claude Code to extend it. The MBA folks should focus on making something useful — a tool you'd actually use at work. The MS folks should dig into the engineering: how do you parse structured outputs? How do you handle API errors? How does streaming work? Everyone should experiment with system prompts to shape the output.