Today we cross the threshold from LLMs as text generators to LLMs as agents that take actions in the world. This is the biggest conceptual shift in the course. By the end of today, you'll have built an AI agent that retrieves documents and uses tools to answer questions.
Think of this as a spectrum, not categories. A chatbot waits for questions and answers them. A copilot works alongside you, suggesting and completing. An agent acts on its own — it can search, compute, write files, call APIs. Higher autonomy means higher value but also higher risk. Most enterprise deployments today are chatbots evolving into copilots. Agents are the frontier, and that's what we're building today.
Every agent, no matter how complex, follows this loop. Perceive the current state — what's the user asking, what information do I have? Plan — what steps should I take? Act — execute a tool call, search, or computation. Observe — what did I get back? Then loop: am I done, or do I need more steps? Claude Code itself follows this pattern. When you ask it to fix a bug, it reads files (perceive), decides what to change (plan), edits code (act), and checks if it worked (observe).
Three main architectures. ReAct (Reasoning + Acting) interleaves thinking and tool use — it's the most common and what Claude Code uses. Plan-and-Execute creates a full plan upfront, then executes step by step — better for well-defined multi-step tasks. Reflexion adds a self-critique step: try, fail, reflect on why, try again — useful when the task is hard and first attempts often fail. In practice, most agents use ReAct or a hybrid.
RAG solves three fundamental LLM limitations. First: LLMs don't know your private data — your financials, policies, internal docs. Second: LLMs' knowledge has a cutoff date — they don't know what happened last week. Third: without sources, LLMs hallucinate confidently. RAG fixes all three by retrieving relevant documents and putting them in context before the model generates a response. The model can now cite sources, stay grounded in facts, and access up-to-date information.
RAG has two phases. Indexing happens offline, once: you take your documents, split them into chunks, convert each chunk to a vector embedding, and store those vectors in a database. Retrieval happens online, per query: you embed the user's question, search for the most similar chunks, and feed those chunks to the LLM along with the question. The LLM generates an answer grounded in the retrieved documents.
Chunking is where most RAG pipelines fail or succeed. Too small: you lose context, the chunk doesn't contain enough information to answer the question. Too large: you waste context window space, the relevant info is diluted. The sweet spot is usually 500-1000 tokens with some overlap between chunks. But the best approach is document-aware chunking that respects the natural structure: split at section headers, keep tables together, respect paragraph boundaries.
Embeddings convert text to points in high-dimensional space. Similar meanings land near each other. When you search, you embed your query and find the closest document chunks. This is similarity search, and it's what makes RAG work. The quality of your embeddings directly determines the quality of your retrieval. Modern embedding models are remarkably good at capturing semantic similarity — "revenue growth" matches "sales increased" even though they share no words.
RAG evaluation has two independent dimensions. Retrieval quality: are you finding the right documents? Precision, recall, and mean reciprocal rank are standard metrics. Generation quality: given the right documents, does the model answer correctly? Faithfulness (does it stick to the sources?), completeness (does it use all relevant info?), and hallucination rate (does it make stuff up?). A RAG system can fail at either stage — bad retrieval or bad generation — so you need to evaluate both.
Students always ask: "Should I use RAG or fine-tuning?" The answer is usually both — they solve different problems. RAG gives the model access to knowledge. Fine-tuning changes the model's behavior and style. Long context (just paste everything in) is simplest but most expensive and only works for small document sets. In practice: fine-tune for behavior, RAG for knowledge, long context for quick prototypes.
Tool use is what turns an LLM from a text generator into an agent. The model doesn't actually call the API — it generates a structured request saying "I want to call this function with these arguments." Your code executes the function and returns the result. The model then uses that result to formulate its response. This is the same pattern behind Claude Code: when it edits a file, it's generating a tool call that the CLI executes.
The tool definition is crucial. The model decides whether to use a tool based on the description — so write it like you're explaining it to a new coworker. Include when to use it, what it does, and what the parameters mean. A bad description leads to the model either never using the tool or using it at the wrong time. Think of tool definitions as part of your context engineering.
Tools can be orchestrated in different patterns. Sequential: each tool uses the output of the previous one — search, then analyze, then summarize. Parallel: run multiple tools at once for speed — get stock price and news simultaneously. Conditional: the model decides which tool to use next based on what it learned — if the stock is down, search for news about why. Most real agents use a mix of these patterns. The model itself decides the orchestration based on the task.
Here's the reveal: Claude Code, the tool you've been using all semester, is itself an agent. It follows the exact perceive-plan-act-observe loop we just discussed. When you ask it to fix a bug, it reads files, plans changes, edits code, runs tests, and iterates. It uses tools: Read, Edit, Bash, Glob, Grep. It uses ReAct-style interleaved reasoning and action. Understanding how Claude Code works gives you a template for building your own agents.
This exercise walks you through building a complete RAG pipeline from scratch. You'll load documents, chunk them, embed them, store the vectors, retrieve relevant chunks for a query, and generate an answer. The key learning: how chunking strategy affects answer quality. Try different chunk sizes and see how the answers change. If your chunks are too small, the model lacks context. Too large, and retrieval precision drops.
Now you'll build a tool-using agent. The starter code gives you three tools: calculator, web search, and file reader. The model decides which tools to use and when. Your task is to add a new tool — maybe a database query, a date calculator, or a document summarizer — and test the agent on multi-step tasks that require multiple tool calls. Notice how the model chains tools together: search for data, calculate something, then write a summary.