Chain-of-Thought: How AI Learned to Show Its Work

The Problem: Models Know But Don't Show

By 2021, large language models were impressively capable. GPT-3 could answer trivia questions, solve word problems, and reason about concepts. But there was a catch: when models got the answer wrong, it was often not because they lacked knowledge—it was because they skipped the reasoning.

Compare these two scenarios:

Scenario 1 (Direct answer):

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: 6 cars

Wrong. The model jumped to an answer without showing work.

Scenario 2 (Step-by-step):

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: Let me think step by step.
   - Start with 3 cars in the lot
   - 2 more cars arrive
   - 3 + 2 = 5
   - There are 5 cars in the lot now.
Answer: 5 cars

Correct. By walking through the reasoning, the model got it right.

The insight was counterintuitive: the model could perform the reasoning, but only if you asked it to show its work.

The Core Idea: Prompting for Intermediate Steps

Chain-of-Thought (CoT) prompting is remarkably simple: when you ask a model a question, include an example of step-by-step reasoning in your prompt.

Instead of asking:

Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A:

You ask with a reasoning example:

Q: If Sally has 5 stones. She finds 2 more stones. Now she has 7 stones. If she loses 1, how many stones does she have?
A: Sally starts with 5 stones. She finds 2 more, so 5 + 2 = 7. Then she loses 1, so 7 - 1 = 6. Sally has 6 stones.

Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A: Alice starts with 3 apples. Bob gives her 2 more, so 3 + 2 = 5. Alice has 5 apples.

By showing a worked example with intermediate reasoning steps, the model learns to output similar reasoning for new problems. This is called "few-shot prompting with chain-of-thought."

The magic: you're not changing the model. You're just changing how you ask it questions.

Why This Works

Decomposing Complex Reasoning

Complex reasoning tasks require multiple steps. Without explicitly prompting for steps, models might:

Skip intermediate reasoning
Jump to incorrect conclusions
Fail on tasks that require composing simpler operations

Asking for step-by-step reasoning forces the model to decompose the problem:

Parse the problem
Identify sub-goals
Solve each sub-goal in order
Combine results into a final answer

Emergent Ability

Here's what's remarkable: the model doesn't need explicit training on step-by-step reasoning. The ability is already latent in the weights from pretraining.

When you show examples of reasoning, the model's prompt context triggers this latent ability. It's an "emergent" behavior—something the model learned during pretraining but never explicitly practiced during supervised fine-tuning.

In-Context Learning

Chain-of-Thought is an extreme example of in-context learning: the model learns how to behave from the context (the prompt itself) rather than from task-specific training.

This is different from fine-tuning:

Fine-tuning: Update the model's weights based on labeled examples
In-context learning: Show examples in the prompt; the model adapts at inference time

The Technical Picture

Few-Shot vs. Zero-Shot

Zero-shot CoT (simplest):

Prompt: "Let's think step by step.\nQ: If there are 3 apples..."
Response: [Model generates step-by-step reasoning]

Just adding "Let's think step by step" before the question can improve reasoning, even without examples.

Few-shot CoT:

Prompt:
Example 1: "Q: ... A: Let me work through this step by step. 
  Step 1: ...
  Step 2: ...
  Answer: ..."

Example 2: "Q: ..."
Response: [Model follows the pattern]

Multiple worked examples improve performance further.

What Gets Generated

When prompted to reason step-by-step, models produce:

Natural language intermediate steps
Explicit problem decomposition
Intermediate numerical calculations
Sometimes self-corrections or sanity checks

These outputs are not just "filler"—they represent actual reasoning. When you remove them (using techniques like "Self-Consistency," discussed later), the model's answers become less accurate.

Why It's Not Just "More Tokens"

An important clarification: CoT works because of the reasoning structure, not just because it generates more text.

You can't just append random text to a prompt and improve performance. The content matters:

Relevant reasoning steps → helps
Irrelevant rambling → hurts or doesn't help
False intermediate steps → hurts

Results: The Numbers

The original paper tested Chain-of-Thought on three categories of tasks.

Arithmetic Reasoning

Task	No CoT	With CoT	Improvement
Adding two numbers (MultiArith)	78%	98%	+20 pp
Grade school math problems (GSM8K)	49%	74%	+25 pp
SVAMP math problems	79%	92%	+13 pp

Dramatic improvements, especially on harder problems.

Commonsense Reasoning

Task	No CoT	With CoT	Improvement
StrategyQA	66%	79%	+13 pp
Date understanding	34%	80%	+46 pp
Tracking shuffled objects	37%	88%	+51 pp

Some improvements are massive (tracking shuffled objects).

Symbolic Reasoning

Task	No CoT	With CoT	Improvement
Last-letter concatenation	2%	79%	+77 pp
Coin flip reasoning	4%	79%	+75 pp

These tasks require careful step-by-step reasoning. Without CoT, models fail catastrophically.

Key Finding: Model Size Matters

Interestingly, CoT only helps larger models.

Small models (GPT-3 350M) showed minimal improvement
Medium models (GPT-3 6.7B) showed modest improvements
Large models (GPT-3 175B) showed massive improvements

This suggested that reasoning ability exists in the model's weights but becomes accessible only at scale.

Why This Matters

Unlocking Latent Reasoning

Before CoT, it seemed like models had fundamental limits on reasoning tasks. The paper showed that the limits were in the prompting strategy, not the model itself.

A 175B model that appears to struggle with math suddenly performs much better when asked to show work. The reasoning ability was always there.

Simple Yet Powerful

CoT is:

Free to use — no fine-tuning required
Model-agnostic — works with any language model
Easy to implement — just add "let's think step by step"
Broadly applicable — works on arithmetic, commonsense, symbolic reasoning, and beyond

This simplicity is deceptive. A one-sentence change to your prompt can cause large improvements.

Practical Applications

In practice, Chain-of-Thought showed that:

Better prompting ≈ Better reasoning — you don't always need a bigger model; you need better prompting
Intermediate steps matter — forcing articulation of reasoning improves accuracy
Scaling isn't just about size — prompting strategy is a form of "scaling" capability
Humans and AI align better with explanations — when models show reasoning, we can verify their logic

The Bigger Picture

From Prediction to Reasoning

The progression of the series so far:

Pretraining (GPT-2/3) → Language modeling
Instruction tuning (FLAN) → Task understanding
Alignment (InstructGPT) → Human preferences
Chain-of-Thought → Reasoning capability

CoT answers a key question: "The model follows instructions and aligns with human preferences. But can it think?"

The answer: "It could think all along. We just needed to ask it to show its work."

Limitations

CoT is powerful but not magic:

Hallucination: Step-by-step reasoning doesn't eliminate made-up facts. A model can reason perfectly but reason from false premises.
Domain limits: Works best on tasks where step-by-step reasoning exists (math, logic, planning). Less effective for tasks requiring broad knowledge.
Computational cost: CoT generates more tokens, increasing inference latency and cost.
Small models: CoT doesn't help smaller models much, suggesting reasoning is a capability that emerges at scale.

Follow-Up Work

Chain-of-Thought opened an entire research direction. Subsequent papers built on it:

Self-Consistency (Wang et al., 2023) — Generate multiple reasoning chains and take a majority vote
Tree-of-Thought (Yao et al., 2023) — Explore multiple reasoning paths, not just one
Retrieval-Augmented Generation (RAG) — Combine reasoning with external knowledge retrieval
Faithful Reasoning — Making sure intermediate steps actually drive the final answer

Code: How It Works

Conceptually, CoT changes the prompt structure:

def standard_prompt(question):
    """Without Chain-of-Thought"""
    return f"Q: {question}\nA:"

def cot_prompt(question, example_q, example_a_with_steps):
    """With Chain-of-Thought (few-shot)"""
    return f"""
{example_q}
A: {example_a_with_steps}

Q: {question}
A: Let me work through this step by step.
"""

# Example usage:
example_q = "If Alice has 5 apples and Bob gives her 3, how many does she have?"
example_a = """
Step 1: Alice starts with 5 apples.
Step 2: Bob gives her 3 apples.
Step 3: 5 + 3 = 8.
Answer: 8 apples.
"""

question = "If there are 3 birds on a tree and 2 fly away, how many remain?"

# Standard approach: "A: 1 bird" (might be wrong)
# CoT approach: Shows work, increases accuracy

The model is the same. The prompt structure is different.

Why This Became Important

When ChatGPT launched in November 2022, users quickly discovered that asking it to "think step by step" improved answers on reasoning tasks. This was Chain-of-Thought in action.

By March 2023 (when GPT-4 launched), OpenAI and others were using CoT-style reasoning as a standard feature. Every modern LLM assistant uses some form of prompted reasoning.

The paper's simplicity and effectiveness made it one of the most influential prompting techniques in modern NLP.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work ← You are here

Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Beginner-friendly
Prerequisite: Basic understanding of prompting and language models Paper: Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903, January 2022)

Chain-of-Thought: How AI Learned to Show Its Work

The Problem: Models Know But Don't Show

Compare these two scenarios:

Scenario 1 (Direct answer):

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: 6 cars

Wrong. The model jumped to an answer without showing work.

Scenario 2 (Step-by-step):

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the lot now?
A: Let me think step by step.
   - Start with 3 cars in the lot
   - 2 more cars arrive
   - 3 + 2 = 5
   - There are 5 cars in the lot now.
Answer: 5 cars

Correct. By walking through the reasoning, the model got it right.

The insight was counterintuitive: the model could perform the reasoning, but only if you asked it to show its work.

The Core Idea: Prompting for Intermediate Steps

Chain-of-Thought (CoT) prompting is remarkably simple: when you ask a model a question, include an example of step-by-step reasoning in your prompt.

Instead of asking:

Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A:

You ask with a reasoning example:

Q: If Sally has 5 stones. She finds 2 more stones. Now she has 7 stones. If she loses 1, how many stones does she have?
A: Sally starts with 5 stones. She finds 2 more, so 5 + 2 = 7. Then she loses 1, so 7 - 1 = 6. Sally has 6 stones.

Q: If Alice has 3 apples and Bob gives her 2 more, how many does she have?
A: Alice starts with 3 apples. Bob gives her 2 more, so 3 + 2 = 5. Alice has 5 apples.

By showing a worked example with intermediate reasoning steps, the model learns to output similar reasoning for new problems. This is called "few-shot prompting with chain-of-thought."

The magic: you're not changing the model. You're just changing how you ask it questions.

Why This Works

Decomposing Complex Reasoning

Complex reasoning tasks require multiple steps. Without explicitly prompting for steps, models might:

Skip intermediate reasoning
Jump to incorrect conclusions
Fail on tasks that require composing simpler operations

Asking for step-by-step reasoning forces the model to decompose the problem:

Parse the problem
Identify sub-goals
Solve each sub-goal in order
Combine results into a final answer

Emergent Ability

Here's what's remarkable: the model doesn't need explicit training on step-by-step reasoning. The ability is already latent in the weights from pretraining.

In-Context Learning

Chain-of-Thought is an extreme example of in-context learning: the model learns how to behave from the context (the prompt itself) rather than from task-specific training.

This is different from fine-tuning:

Fine-tuning: Update the model's weights based on labeled examples
In-context learning: Show examples in the prompt; the model adapts at inference time

The Technical Picture

Few-Shot vs. Zero-Shot

Zero-shot CoT (simplest):

Prompt: "Let's think step by step.\nQ: If there are 3 apples..."
Response: [Model generates step-by-step reasoning]

Just adding "Let's think step by step" before the question can improve reasoning, even without examples.

Few-shot CoT:

Prompt:
Example 1: "Q: ... A: Let me work through this step by step. 
  Step 1: ...
  Step 2: ...
  Answer: ..."

Example 2: "Q: ..."
Response: [Model follows the pattern]

Multiple worked examples improve performance further.

What Gets Generated

When prompted to reason step-by-step, models produce:

Natural language intermediate steps
Explicit problem decomposition
Intermediate numerical calculations
Sometimes self-corrections or sanity checks

These outputs are not just "filler"—they represent actual reasoning. When you remove them (using techniques like "Self-Consistency," discussed later), the model's answers become less accurate.

Why It's Not Just "More Tokens"

An important clarification: CoT works because of the reasoning structure, not just because it generates more text.

You can't just append random text to a prompt and improve performance. The content matters:

Relevant reasoning steps → helps
Irrelevant rambling → hurts or doesn't help
False intermediate steps → hurts

Results: The Numbers

The original paper tested Chain-of-Thought on three categories of tasks.

Arithmetic Reasoning

Task	No CoT	With CoT	Improvement
Adding two numbers (MultiArith)	78%	98%	+20 pp
Grade school math problems (GSM8K)	49%	74%	+25 pp
SVAMP math problems	79%	92%	+13 pp

Dramatic improvements, especially on harder problems.

Commonsense Reasoning

Task	No CoT	With CoT	Improvement
StrategyQA	66%	79%	+13 pp
Date understanding	34%	80%	+46 pp
Tracking shuffled objects	37%	88%	+51 pp

Some improvements are massive (tracking shuffled objects).

Symbolic Reasoning

Task	No CoT	With CoT	Improvement
Last-letter concatenation	2%	79%	+77 pp
Coin flip reasoning	4%	79%	+75 pp

These tasks require careful step-by-step reasoning. Without CoT, models fail catastrophically.

Key Finding: Model Size Matters

Interestingly, CoT only helps larger models.

Small models (GPT-3 350M) showed minimal improvement
Medium models (GPT-3 6.7B) showed modest improvements
Large models (GPT-3 175B) showed massive improvements

This suggested that reasoning ability exists in the model's weights but becomes accessible only at scale.

Why This Matters

Unlocking Latent Reasoning

Before CoT, it seemed like models had fundamental limits on reasoning tasks. The paper showed that the limits were in the prompting strategy, not the model itself.

A 175B model that appears to struggle with math suddenly performs much better when asked to show work. The reasoning ability was always there.

Simple Yet Powerful

CoT is:

Free to use — no fine-tuning required
Model-agnostic — works with any language model
Easy to implement — just add "let's think step by step"
Broadly applicable — works on arithmetic, commonsense, symbolic reasoning, and beyond

This simplicity is deceptive. A one-sentence change to your prompt can cause large improvements.

Practical Applications

In practice, Chain-of-Thought showed that:

Better prompting ≈ Better reasoning — you don't always need a bigger model; you need better prompting
Intermediate steps matter — forcing articulation of reasoning improves accuracy
Scaling isn't just about size — prompting strategy is a form of "scaling" capability
Humans and AI align better with explanations — when models show reasoning, we can verify their logic

The Bigger Picture

From Prediction to Reasoning

The progression of the series so far:

Pretraining (GPT-2/3) → Language modeling
Instruction tuning (FLAN) → Task understanding
Alignment (InstructGPT) → Human preferences
Chain-of-Thought → Reasoning capability

CoT answers a key question: "The model follows instructions and aligns with human preferences. But can it think?"

The answer: "It could think all along. We just needed to ask it to show its work."

Limitations

CoT is powerful but not magic:

Hallucination: Step-by-step reasoning doesn't eliminate made-up facts. A model can reason perfectly but reason from false premises.
Domain limits: Works best on tasks where step-by-step reasoning exists (math, logic, planning). Less effective for tasks requiring broad knowledge.
Computational cost: CoT generates more tokens, increasing inference latency and cost.
Small models: CoT doesn't help smaller models much, suggesting reasoning is a capability that emerges at scale.

Follow-Up Work

Chain-of-Thought opened an entire research direction. Subsequent papers built on it:

Self-Consistency (Wang et al., 2023) — Generate multiple reasoning chains and take a majority vote
Tree-of-Thought (Yao et al., 2023) — Explore multiple reasoning paths, not just one
Retrieval-Augmented Generation (RAG) — Combine reasoning with external knowledge retrieval
Faithful Reasoning — Making sure intermediate steps actually drive the final answer

Code: How It Works

Conceptually, CoT changes the prompt structure:

def standard_prompt(question):
    """Without Chain-of-Thought"""
    return f"Q: {question}\nA:"

def cot_prompt(question, example_q, example_a_with_steps):
    """With Chain-of-Thought (few-shot)"""
    return f"""
{example_q}
A: {example_a_with_steps}

Q: {question}
A: Let me work through this step by step.
"""

# Example usage:
example_q = "If Alice has 5 apples and Bob gives her 3, how many does she have?"
example_a = """
Step 1: Alice starts with 5 apples.
Step 2: Bob gives her 3 apples.
Step 3: 5 + 3 = 8.
Answer: 8 apples.
"""

question = "If there are 3 birds on a tree and 2 fly away, how many remain?"

# Standard approach: "A: 1 bird" (might be wrong)
# CoT approach: Shows work, increases accuracy

The model is the same. The prompt structure is different.

Why This Became Important

When ChatGPT launched in November 2022, users quickly discovered that asking it to "think step by step" improved answers on reasoning tasks. This was Chain-of-Thought in action.

By March 2023 (when GPT-4 launched), OpenAI and others were using CoT-style reasoning as a standard feature. Every modern LLM assistant uses some form of prompted reasoning.

The paper's simplicity and effectiveness made it one of the most influential prompting techniques in modern NLP.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work ← You are here

Chain-of-Thought: How AI Learned to Show Its Work

The Problem: Models Know But Don't Show

The Core Idea: Prompting for Intermediate Steps

Why This Works

Decomposing Complex Reasoning

Emergent Ability

In-Context Learning

The Technical Picture

Few-Shot vs. Zero-Shot

What Gets Generated

Why It's Not Just "More Tokens"

Results: The Numbers

Arithmetic Reasoning

Commonsense Reasoning

Symbolic Reasoning

Key Finding: Model Size Matters

Why This Matters

Unlocking Latent Reasoning

Simple Yet Powerful

Practical Applications

The Bigger Picture

From Prediction to Reasoning

Limitations

Follow-Up Work

Code: How It Works

Why This Became Important

Series Navigation

Chain-of-Thought: How AI Learned to Show Its Work

The Problem: Models Know But Don't Show

The Core Idea: Prompting for Intermediate Steps

Why This Works

Decomposing Complex Reasoning

Emergent Ability

In-Context Learning

The Technical Picture

Few-Shot vs. Zero-Shot

What Gets Generated

Why It's Not Just "More Tokens"

Results: The Numbers

Arithmetic Reasoning

Commonsense Reasoning

Symbolic Reasoning

Key Finding: Model Size Matters

Why This Matters

Unlocking Latent Reasoning

Simple Yet Powerful

Practical Applications

The Bigger Picture

From Prediction to Reasoning

Limitations

Follow-Up Work

Code: How It Works

Why This Became Important

Series Navigation