Chain-of-Thought: How AI Learned to Show Its Work
A deceptively simple insight: if you ask a model to 'think step by step,' it reasons better. Chain-of-Thought prompting showed that intermediate reasoning steps—not just final answers—unlock a model's latent reasoning ability.
Chain-of-Thought: How AI Learned to Show Its Work
The Problem: Models Know But Don't Show
By 2021, large language models were impressively capable. GPT-3 could answer trivia questions, solve word problems, and reason about concepts. But there was a catch: when models got the answer wrong, it was often not because they lacked knowledge—it was because they skipped the reasoning.
Compare these two scenarios:
Scenario 1 (Direct answer):
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: 6 cars
Wrong. The model jumped to an answer without showing work.
Scenario 2 (Step-by-step):
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: Let me think step by step.
- Start with 3 cars in the lot
- 2 more cars arrive
- 3 + 2 = 5
- There are 5 cars in the lot now.
Answer: 5 cars
Correct. By walking through the reasoning, the model got it right.
The insight was counterintuitive: the model could perform the reasoning, but only if you asked it to show its work.
The Core Idea: Prompting for Intermediate Steps
Chain-of-Thought (CoT) prompting is remarkably simple: when you ask a model a question, include an example of step-by-step reasoning in your prompt.
Instead of asking:
Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A:
You ask with a reasoning example:
Q: If Sally has 5 stones. She finds 2 more stones. Now she has 7 stones. If she loses 1, how many stones does she have?
A: Sally starts with 5 stones. She finds 2 more, so 5 + 2 = 7. Then she loses 1, so 7 - 1 = 6. Sally has 6 stones.
Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A: Alice starts with 3 apples. Bob gives her 2 more, so 3 + 2 = 5. Alice has 5 apples.
By showing a worked example with intermediate reasoning steps, the model learns to output similar reasoning for new problems. This is called "few-shot prompting with chain-of-thought."
The magic: you're not changing the model. You're just changing how you ask it questions.
Why This Works
Decomposing Complex Reasoning
Complex reasoning tasks require multiple steps. Without explicitly prompting for steps, models might:
- Skip intermediate reasoning
- Jump to incorrect conclusions
- Fail on tasks that require composing simpler operations
Asking for step-by-step reasoning forces the model to decompose the problem:
- Parse the problem
- Identify sub-goals
- Solve each sub-goal in order
- Combine results into a final answer
Emergent Ability
Here's what's remarkable: the model doesn't need explicit training on step-by-step reasoning. The ability is already latent in the weights from pretraining.
When you show examples of reasoning, the model's prompt context triggers this latent ability. It's an "emergent" behavior—something the model learned during pretraining but never explicitly practiced during supervised fine-tuning.
In-Context Learning
Chain-of-Thought is an extreme example of in-context learning: the model learns how to behave from the context (the prompt itself) rather than from task-specific training.
This is different from fine-tuning:
- Fine-tuning: Update the model's weights based on labeled examples
- In-context learning: Show examples in the prompt; the model adapts at inference time
The Technical Picture
Few-Shot vs. Zero-Shot
Zero-shot CoT (simplest):
Prompt: "Let's think step by step.\nQ: If there are 3 apples..."
Response: [Model generates step-by-step reasoning]
Just adding "Let's think step by step" before the question can improve reasoning, even without examples.
Few-shot CoT:
Prompt:
Example 1: "Q: ... A: Let me work through this step by step.
Step 1: ...
Step 2: ...
Answer: ..."
Example 2: "Q: ..."
Response: [Model follows the pattern]
Multiple worked examples improve performance further.
What Gets Generated
When prompted to reason step-by-step, models produce:
- Natural language intermediate steps
- Explicit problem decomposition
- Intermediate numerical calculations
- Sometimes self-corrections or sanity checks
These outputs are not just "filler"—they represent actual reasoning. When you remove them (using techniques like "Self-Consistency," discussed later), the model's answers become less accurate.
Why It's Not Just "More Tokens"
An important clarification: CoT works because of the reasoning structure, not just because it generates more text.
You can't just append random text to a prompt and improve performance. The content matters:
- Relevant reasoning steps → helps
- Irrelevant rambling → hurts or doesn't help
- False intermediate steps → hurts
Results: The Numbers
The original paper tested Chain-of-Thought on three categories of tasks.
Arithmetic Reasoning
| Task | No CoT | With CoT | Improvement |
|---|---|---|---|
| Adding two numbers (MultiArith) | 78% | 98% | +20 pp |
| Grade school math problems (GSM8K) | 49% | 74% | +25 pp |
| SVAMP math problems | 79% | 92% | +13 pp |
Dramatic improvements, especially on harder problems.
Commonsense Reasoning
| Task | No CoT | With CoT | Improvement |
|---|---|---|---|
| StrategyQA | 66% | 79% | +13 pp |
| Date understanding | 34% | 80% | +46 pp |
| Tracking shuffled objects | 37% | 88% | +51 pp |
Some improvements are massive (tracking shuffled objects).
Symbolic Reasoning
| Task | No CoT | With CoT | Improvement |
|---|---|---|---|
| Last-letter concatenation | 2% | 79% | +77 pp |
| Coin flip reasoning | 4% | 79% | +75 pp |
These tasks require careful step-by-step reasoning. Without CoT, models fail catastrophically.
Key Finding: Model Size Matters
Interestingly, CoT only helps larger models.
- Small models (GPT-3 350M) showed minimal improvement
- Medium models (GPT-3 6.7B) showed modest improvements
- Large models (GPT-3 175B) showed massive improvements
This suggested that reasoning ability exists in the model's weights but becomes accessible only at scale.
Why This Matters
Unlocking Latent Reasoning
Before CoT, it seemed like models had fundamental limits on reasoning tasks. The paper showed that the limits were in the prompting strategy, not the model itself.
A 175B model that appears to struggle with math suddenly performs much better when asked to show work. The reasoning ability was always there.
Simple Yet Powerful
CoT is:
- Free to use — no fine-tuning required
- Model-agnostic — works with any language model
- Easy to implement — just add "let's think step by step"
- Broadly applicable — works on arithmetic, commonsense, symbolic reasoning, and beyond
This simplicity is deceptive. A one-sentence change to your prompt can cause large improvements.
Practical Applications
In practice, Chain-of-Thought showed that:
- Better prompting ≈ Better reasoning — you don't always need a bigger model; you need better prompting
- Intermediate steps matter — forcing articulation of reasoning improves accuracy
- Scaling isn't just about size — prompting strategy is a form of "scaling" capability
- Humans and AI align better with explanations — when models show reasoning, we can verify their logic
The Bigger Picture
From Prediction to Reasoning
The progression of the series so far:
- Pretraining (GPT-2/3) → Language modeling
- Instruction tuning (FLAN) → Task understanding
- Alignment (InstructGPT) → Human preferences
- Chain-of-Thought → Reasoning capability
CoT answers a key question: "The model follows instructions and aligns with human preferences. But can it think?"
The answer: "It could think all along. We just needed to ask it to show its work."
Limitations
CoT is powerful but not magic:
- Hallucination: Step-by-step reasoning doesn't eliminate made-up facts. A model can reason perfectly but reason from false premises.
- Domain limits: Works best on tasks where step-by-step reasoning exists (math, logic, planning). Less effective for tasks requiring broad knowledge.
- Computational cost: CoT generates more tokens, increasing inference latency and cost.
- Small models: CoT doesn't help smaller models much, suggesting reasoning is a capability that emerges at scale.
Follow-Up Work
Chain-of-Thought opened an entire research direction. Subsequent papers built on it:
- Self-Consistency (Wang et al., 2023) — Generate multiple reasoning chains and take a majority vote
- Tree-of-Thought (Yao et al., 2023) — Explore multiple reasoning paths, not just one
- Retrieval-Augmented Generation (RAG) — Combine reasoning with external knowledge retrieval
- Faithful Reasoning — Making sure intermediate steps actually drive the final answer
Code: How It Works
Conceptually, CoT changes the prompt structure:
def standard_prompt(question):
"""Without Chain-of-Thought"""
return f"Q: {question}\nA:"
def cot_prompt(question, example_q, example_a_with_steps):
"""With Chain-of-Thought (few-shot)"""
return f"""
{example_q}
A: {example_a_with_steps}
Q: {question}
A: Let me work through this step by step.
"""
# Example usage:
example_q = "If Alice has 5 apples and Bob gives her 3, how many does she have?"
example_a = """
Step 1: Alice starts with 5 apples.
Step 2: Bob gives her 3 apples.
Step 3: 5 + 3 = 8.
Answer: 8 apples.
"""
question = "If there are 3 birds on a tree and 2 fly away, how many remain?"
# Standard approach: "A: 1 bird" (might be wrong)
# CoT approach: Shows work, increases accuracy
The model is the same. The prompt structure is different.
Why This Became Important
When ChatGPT launched in November 2022, users quickly discovered that asking it to "think step by step" improved answers on reasoning tasks. This was Chain-of-Thought in action.
By March 2023 (when GPT-4 launched), OpenAI and others were using CoT-style reasoning as a standard feature. Every modern LLM assistant uses some form of prompted reasoning.
The paper's simplicity and effectiveness made it one of the most influential prompting techniques in modern NLP.
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions
- InstructGPT: How AI Learned What Humans Actually Want
- Chain-of-Thought: How AI Learned to Show Its Work ← You are here
Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Beginner-friendly
Prerequisite: Basic understanding of prompting and language models
Paper: Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903, January 2022)