Today we go deep on two things that separate amateur from expert use of LLMs: reasoning models and context engineering. If Week 1 was "how do these work" and Week 2 was "which one do I pick," this week is "how do I get the most out of them."
Let's be precise about what we mean. Most of what LLMs do is sophisticated pattern matching — recalling and recombining information from training. True reasoning — multi-step logic, planning, novel problem-solving — is much harder. The debate about whether LLMs truly "reason" or just simulate reasoning is ongoing, but practically speaking, we care about whether they get the right answer on complex tasks. That's what chain-of-thought and reasoning models improve.
This is from Wei et al. 2022, your reading for this week. The key finding: simply asking the model to show its work dramatically improves accuracy on reasoning tasks. GSM8K math benchmark went from ~18% to ~57% accuracy just by adding "Let's think step by step." Why does this work? By generating intermediate steps, the model can use its own output as working memory. Each step is a new pattern-matching opportunity. It's like giving yourself scratch paper on an exam.
CoT isn't universally helpful — it adds latency and cost. Use it when the task requires multiple steps of reasoning. Don't use it for simple retrieval or classification tasks. The scratch paper rule is a good heuristic: if a smart human would need to write things down to get the answer right, the model probably benefits from CoT too.
There are two approaches to reasoning. Prompt-time CoT: you tell the model to think step by step. Works with any model, free to use, but limited. Train-time reasoning: models like o1, DeepSeek-R1, and Claude's extended thinking are specifically trained to reason. They generate a chain of thought internally before producing an answer. These models are significantly better on hard problems — math, coding, complex analysis — but they cost more because they generate many more tokens internally.
Reasoning models insert a thinking phase between your prompt and the response. The model generates potentially thousands of tokens of internal deliberation — exploring approaches, catching its own mistakes, reconsidering. With Claude's extended thinking, you can actually see this reasoning. With o1, it's hidden. The cost implication: reasoning models can use 10-100x more tokens than standard models for the same query, but the quality improvement on hard tasks is dramatic.
Reasoning models are expensive — both in cost and latency. Use them judiciously. Hard task + high accuracy requirement = reasoning model. Everything else = standard model with good prompting. The most common mistake is using a reasoning model for tasks that don't need it. Summarization, translation, simple Q&A — these are pattern matching tasks where standard models excel.
Context engineering is the single most important skill for using LLMs effectively. The model is the same in both cases. The difference is entirely in what context you provide. This is why "prompt engineering" as a discipline exists — but I prefer "context engineering" because it's broader. It's not just the prompt text, it's everything you put in the context window: system prompt, examples, retrieved documents, conversation history, tool results.
The context window is the model's entire working memory. Everything the model knows about your task must fit in this window. Claude's context window is 200K tokens — about 150,000 words or 500 pages. That sounds huge, but in practice you need to be strategic about what goes in. Every token you add costs money and can dilute the signal. The skill is fitting maximum signal into minimum tokens.
Think of context engineering as a stack. At the bottom: the system prompt defines who the model is and how it should behave. Above that: few-shot examples show (not tell) the desired behavior. Then retrieved context from a knowledge base. Then conversation history for multi-turn interactions. And tool results for real-time data. Each layer gives you more control. A well-engineered context stack can make a standard model outperform a reasoning model with a bad prompt.
The system prompt is your most important lever. It should include four things: Role (who the model is), Task (what it's doing), Constraints (what it must NOT do), and Format (how to structure output). The difference between "helpful assistant" and a well-crafted system prompt is the difference between a general intern and a domain expert. Invest time here — it's the highest-ROI activity in any GenAI project.
Few-shot examples are the most reliable way to control model behavior. Instead of describing what you want, you show it. The model picks up on the pattern — format, tone, reasoning style, edge cases. Three to five examples usually suffices. Include at least one edge case or tricky example. The examples don't just teach the model what to output — they implicitly communicate your standards and expectations.
These five patterns are your core toolkit. Persona primes the model with relevant knowledge. CoT improves reasoning. Self-critique catches errors. Output format ensures usable responses. Constraints prevent common failure modes like hallucination. You can combine them — a strong prompt often uses all five.
Common mistakes. Vague instructions get vague outputs. Contradictory instructions confuse the model — it'll randomly prioritize one over the other. Overloaded prompts lead to partial completion. Negative-only instructions ("don't do X") are less effective than positive instructions ("do Y instead"). When in doubt: be specific, be consistent, and break complex tasks into separate prompts.
This is the "lost in the middle" phenomenon. When you stuff a lot of text into the context window, the model pays most attention to the beginning (system prompt) and end (most recent message). Information in the middle gets less attention. This has practical implications: put your most important instructions at the beginning and end. If you're doing RAG, put the most relevant documents near the end, close to the query. Don't rely on the model noticing a crucial instruction buried in page 50 of your context.
Prompt engineering is not a one-shot activity. It's iterative, just like software development. Draft a prompt, test it on diverse cases, evaluate the outputs, fix failures, repeat. The most common mistake is testing on one or two examples and declaring victory. You need at least 10-20 test cases covering normal inputs, edge cases, and adversarial inputs. This is exactly what we'll practice in the hands-on exercise.
You'll run the same hard problem through three approaches: direct prompting, CoT prompting, and extended thinking. Compare the answers, the number of tokens used, and the cost. The goal is to build intuition for when reasoning models earn their keep. Try at least three different problem types — you'll see that the benefit varies dramatically by task type.
This is the core exercise. You'll use the scaffold in context_engineer.py to systematically build and test a prompt. Start with the persona, add instructions, then examples, then constraints. Test it, find where it fails, fix it. Use Claude Code to help you — it's meta: using an AI to help you write better prompts for AI. Spend at least 20 minutes iterating. The first version of your prompt will not be good enough.