InstructGPT: How AI Learned What Humans Actually Want

The Problem: Better Instructions, Same Problem

FLAN showed us that instruction tuning works. Fine-tune a model on tasks described in natural language, and it follows instructions better. Google's FLAN and OpenAI's own supervised fine-tuning experiments proved the concept.

But there was a catch.

Try asking a fine-tuned GPT-3 a question, and you'd get an answer. Try asking it a tricky question, and it might:

Make up confident-sounding nonsense (hallucinate)
Give an answer that's technically correct but misleading
Follow instructions too literally when common sense should override them
Produce verbose, boring outputs when conciseness would be better

The model had learned to follow instructions. But it hadn't learned what humans actually wanted.

Here's the core tension: instruction tuning optimizes for doing what you're told. RLHF optimizes for doing what you want.

Those aren't the same thing.

The Human-in-the-Loop Insight

The breakthrough was deceptively simple: ask humans to rate model outputs, then train the model to predict which outputs humans prefer.

Instead of:

Input: "Write me a poem about cats"
Output: "Roses are red, violets are blue, cats are... [continues awkwardly]"

You get:

Input: "Write me a poem about cats"
Rating by humans: 9/10 - "Delightful, creative, captures cat personality"

And you train the model to generate outputs that humans rate highly.

But how do you optimize for human preference? You can't just compute "how much did humans like this?" as a loss function.

Enter: Reinforcement Learning from Human Feedback (RLHF).

The Three-Step Process

InstructGPT isn't just one technique. It's a pipeline:

Step 1: Supervised Fine-Tuning (SFT)

Start with FLAN—train the model to follow instructions on demonstration data. This is the baseline.

Instruction: Write a haiku about spring
Output: Blossoms break the frost
        Green shoots push through tired earth
        Hope returns again

Data: ~13K instruction-output pairs, written by human contractors.

Why this step? The reward model (coming next) needs outputs to compare. Pure pretraining isn't helpful. You need the model to at least attempt instruction-following.

Step 2: Reward Modeling

This is where humans come in. Take model outputs and have humans compare them.

Instruction: Explain quantum entanglement to a 10-year-old

Output A (GPT-3 SFT): "Quantum entanglement is a phenomenon where two 
or more particles become correlated such that the quantum state cannot 
be described independently..."

Output B (GPT-3 SFT): "Imagine two magic coins that are connected. When 
you flip one and it lands on heads, the other coin instantly lands on 
tails—even if it's far away. That's kind of like entanglement..."

Human rater: Prefers Output B (much better for 10-year-olds)

Collect ~33K comparisons (human pairs rating outputs).

From these pairwise comparisons, train a reward model: a neural network that takes text and outputs a score representing "how much would a human like this?"

# Conceptual reward model
# InstructGPT used a 6B GPT-3 model with the final unembedding layer
# removed, replaced by a linear projection to a single scalar output.
reward_model = GPT3RewardModel(base_model="gpt3-6B")

# Takes generated text, outputs a single score
score = reward_model(prompt + response)  # float, higher = humans prefer it

The reward model learns to predict: "if a human read this, would they like it?"

Key insight: You're not training the language model yet. You're training a separate model that judges outputs.

Step 3: Reinforcement Learning (RLHF)

Now use the reward model to train the language model.

The language model generates text. The reward model scores it. The language model gets reinforcement: high scores → encourage those outputs, low scores → discourage them.

This is classical reinforcement learning:

Agent: Language model (GPT-3)
Environment: Text generation task
Reward: Score from the reward model
Objective: Maximize expected reward

# Simplified RL loop
language_model = GPT3()
reward_model = trained_reward_model  # from Step 2

for epoch in range(num_epochs):
    # Generate text from language model
    prompt = "Explain quantum entanglement to a 10-year-old"
    generated_text = language_model.generate(prompt)
    
    # Get reward score
    reward = reward_model(generated_text)
    
    # Update language model to maximize reward
    # (using PPO - Proximal Policy Optimization)
    loss = -reward  # negative because we're minimizing loss
    loss.backward()
    optimizer.step()

But there's a catch: if you purely optimize for reward, the model might find weird loopholes or degrade in capability.

Solution: Constrain the model to stay close to the SFT version using a technique called KL divergence penalty.

# Actual RL objective
reward = reward_model(generated_text)
kl_penalty = kl_divergence(language_model, sft_model)
loss = -reward + 0.02 * kl_penalty

This says: "maximize reward, but don't drift too far from the SFT baseline." It prevents reward hacking and preserves general capability.

Why This Actually Works

Alignment Without Explicit Rules

You can't write down "how to be helpful" in rules. What does it mean to give a helpful answer?

Accuracy
Clarity
Appropriate tone
Acknowledging uncertainty
Conciseness vs. detail (context-dependent)
Honesty even when it's uncomfortable
...dozens of other factors

RLHF learns these implicitly. The reward model sees thousands of human preferences and learns their underlying pattern.

Scaling Human Feedback

You can't have humans evaluate every output. So you:

Have humans compare a sample of outputs (~33K comparisons)
Train a model to predict human preferences
Use that model to reward the language model

This is scalable. Step 1 costs money but is manageable. Step 2 and 3 are computation.

Emergent Behavior

Something remarkable happens: the model learns concepts it was never explicitly trained on.

After RLHF, models become better at:

Admitting uncertainty ("I'm not sure, but...")
Declining harmful requests
Showing reasoning step-by-step
Correcting themselves

These behaviors emerge from optimizing for human preference, even though they weren't explicitly rewarded.

The Technical Picture

Reward Model Architecture

InstructGPT used a 6B parameter GPT-3 model with:

The final unembedding layer removed
A linear projection layer added to output a single scalar reward score
Input: prompt + model-generated response
Output: single scalar reward score

The model learns from contrastive pairs: "this output is better than that one."

Training objective:

# Given a pair (better_output, worse_output)
score_better = reward_model(better_output)
score_worse = reward_model(worse_output)

# Want: score_better > score_worse
loss = -log(sigmoid(score_better - score_worse))

This is binary classification in disguise: given two texts, predict which one humans prefer.

RL Algorithm: PPO

InstructGPT uses Proximal Policy Optimization (PPO), a standard RL algorithm.

PPO is stable and sample-efficient compared to simpler policy gradient methods. It prevents the model from changing too rapidly per update, which helps preserve capability.

Simplified version:

for epoch in range(epochs):
    # Rollout: generate completions for prompts
    prompts = load_batch(prompt_dataset)
    completions = language_model.generate(prompts)
    
    # Evaluate: get rewards
    rewards = reward_model(completions)
    
    # Update: optimize with PPO
    ppo_update(language_model, prompts, completions, rewards)

Hyperparameters

Batch size: 256-512 prompts per update
Learning rate: Small, ~1e-5 (don't want to overwrite pretraining)
KL coefficient: 0.02 (balance between reward and staying close to SFT)
RL epochs: Few iterations (1-5) over each batch
Prompt diversity: Mix in-distribution and out-of-distribution prompts

Results: InstructGPT vs GPT-3

Human Evaluation

Human labelers compared outputs from different models side by side. The results were measured as win rates — how often labelers preferred one model's output over another:

Comparison	Preferred InstructGPT
InstructGPT (175B) vs. GPT-3 (175B)	~85% of the time
InstructGPT (175B) vs. GPT-3 prompted (175B)	~71% of the time
InstructGPT (175B) vs. SFT (175B)	~61% of the time

The most striking result: the 1.3B parameter InstructGPT model was preferred over the 175B GPT-3 — a model 100× larger. RLHF made a tiny model more useful than a giant one.

Specific Behaviors

Writing Quality:

GPT-3 (SFT): Rambling, unclear, often grammatically awkward
InstructGPT: Concise, well-structured, polished

Truthfulness:

GPT-3: Confident lies ("Einstein discovered electricity")
InstructGPT: Admits uncertainty ("I don't know, but...")

Harm Avoidance:

GPT-3: Explains how to make explosives or synthesize drugs if asked
InstructGPT: Declines harmful requests while being respectful

Math/Logic:

GPT-3: Often wrong on arithmetic
InstructGPT: Frequently correct (and shows work)

Scaling

The technique scales across model sizes. OpenAI tested on 1.3B, 6B, and 175B parameter models. All sizes benefited from RLHF, with consistent preference improvements at every scale.

The headline result: a 1.3B InstructGPT model was preferred over the raw 175B GPT-3. This demonstrated that alignment techniques can be more impactful than raw scale — a finding with major practical implications.

Why This Mattered: The Path to ChatGPT

InstructGPT was published by OpenAI in March 2022 — eight months before ChatGPT launched in November 2022.

InstructGPT was the technique that made ChatGPT possible. OpenAI described ChatGPT as a "sibling model" to InstructGPT, trained with the same RLHF pipeline on a more conversational dataset.

The pipeline is:

Pretraining (GPT-3 style)
Instruction tuning (FLAN style)
RLHF (InstructGPT style)

This became the standard. Claude, Gemini, and other modern assistants all use RLHF.

Why RLHF Mattered

Before InstructGPT:

Models were impressive but unpredictable
They would confidently hallucinate
They'd follow instructions to absurd extremes
Companies couldn't reliably control behavior

After InstructGPT:

Models were genuinely helpful
They admitted uncertainty
They declined harmful requests
Companies had a path to alignment

This wasn't just an incremental improvement. It was the difference between "impressive toy" and "actually usable assistant."

The Alignment Story

InstructGPT introduced a framework for alignment: use human feedback to shape model behavior.

This is important because:

Alignment is hard to specify: You can't write perfect rules for helpful behavior
Humans can judge better than they can describe: We know good answers when we see them
Scalable: Train a reward model once, use it to train many models

But it's not perfect:

Reward models can be gamed (model learns to exploit the reward signal)
Human raters have biases (biased feedback → biased reward model)
Changing the reward model can dramatically change behavior
You're only as good as your training data (biases accumulate)

Still, RLHF became the standard approach to alignment in LLMs. It's not perfect, but it works.

Practical Insights

Cost Breakdown (Estimated)

Approximate costs for RLHF training at GPT-3 scale (industry estimates, not from the paper):

Pretraining: $5-10 million (one-time)
Instruction data: $50K-100K (13K examples × human annotation)
Reward model training: $100K+ (GPU time)
RLHF: $500K-1M+ (lots of RL sampling and compute)

This is expensive but much cheaper than the pretraining. Practical for companies with scale.

Why You See This Everywhere

RLHF works. Modern models use it. Why?

Simple concept: Humans prefer X, reward X
Empirically effective: +2-3 point improvements are real
Scalable: Doesn't require model retraining from scratch
Adaptable: Can swap reward models or feedback sources

The Bigger Picture

From Language Modeling to Assistance

The three-step pipeline (pretraining → instruction tuning → RLHF) is how we went from:

GPT-2/3 era (2018-2020):

You: "Write a story about dragons"
GPT-3: "Write a story about dragons flying in the sky, breathing fire on 
castles, fighting knights, and... [continues for pages unprompted]"
You: [frustrated] "I didn't ask for all that"

ChatGPT era (2022+):

You: "Write a short story about dragons"
ChatGPT: "# The Lonely Dragon

In a valley of grey stone lived Ember, a dragon with scales of silver..."
[200 words, exactly what was asked for, engaging, well-written]

RLHF is a large part of why the second version exists.

Open Questions

RLHF doesn't solve everything:

Hallucination: Models still make things up; RLHF helps but doesn't eliminate it
Adversarial robustness: Careful prompting can still break alignment
Long-horizon reasoning: Hard to reward long chains of reasoning with short feedback
Value misalignment: What if human raters are wrong? What if we optimize for the wrong thing?

These remain active research areas.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want ← You are here

Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Intermediate
Prerequisite: Understanding of FLAN and basic reinforcement learning concepts

InstructGPT: How AI Learned What Humans Actually Want

The Problem: Better Instructions, Same Problem

But there was a catch.

Try asking a fine-tuned GPT-3 a question, and you'd get an answer. Try asking it a tricky question, and it might:

Make up confident-sounding nonsense (hallucinate)
Give an answer that's technically correct but misleading
Follow instructions too literally when common sense should override them
Produce verbose, boring outputs when conciseness would be better

The model had learned to follow instructions. But it hadn't learned what humans actually wanted.

Here's the core tension: instruction tuning optimizes for doing what you're told. RLHF optimizes for doing what you want.

Those aren't the same thing.

The Human-in-the-Loop Insight

The breakthrough was deceptively simple: ask humans to rate model outputs, then train the model to predict which outputs humans prefer.

Instead of:

Input: "Write me a poem about cats"
Output: "Roses are red, violets are blue, cats are... [continues awkwardly]"

You get:

Input: "Write me a poem about cats"
Rating by humans: 9/10 - "Delightful, creative, captures cat personality"

And you train the model to generate outputs that humans rate highly.

But how do you optimize for human preference? You can't just compute "how much did humans like this?" as a loss function.

Enter: Reinforcement Learning from Human Feedback (RLHF).

The Three-Step Process

InstructGPT isn't just one technique. It's a pipeline:

Step 1: Supervised Fine-Tuning (SFT)

Start with FLAN—train the model to follow instructions on demonstration data. This is the baseline.

Instruction: Write a haiku about spring
Output: Blossoms break the frost
        Green shoots push through tired earth
        Hope returns again

Data: ~13K instruction-output pairs, written by human contractors.

Why this step? The reward model (coming next) needs outputs to compare. Pure pretraining isn't helpful. You need the model to at least attempt instruction-following.

Step 2: Reward Modeling

This is where humans come in. Take model outputs and have humans compare them.

Instruction: Explain quantum entanglement to a 10-year-old

Output A (GPT-3 SFT): "Quantum entanglement is a phenomenon where two 
or more particles become correlated such that the quantum state cannot 
be described independently..."

Output B (GPT-3 SFT): "Imagine two magic coins that are connected. When 
you flip one and it lands on heads, the other coin instantly lands on 
tails—even if it's far away. That's kind of like entanglement..."

Human rater: Prefers Output B (much better for 10-year-olds)

Collect ~33K comparisons (human pairs rating outputs).

From these pairwise comparisons, train a reward model: a neural network that takes text and outputs a score representing "how much would a human like this?"

# Conceptual reward model
# InstructGPT used a 6B GPT-3 model with the final unembedding layer
# removed, replaced by a linear projection to a single scalar output.
reward_model = GPT3RewardModel(base_model="gpt3-6B")

# Takes generated text, outputs a single score
score = reward_model(prompt + response)  # float, higher = humans prefer it

The reward model learns to predict: "if a human read this, would they like it?"

Key insight: You're not training the language model yet. You're training a separate model that judges outputs.

Step 3: Reinforcement Learning (RLHF)

Now use the reward model to train the language model.

The language model generates text. The reward model scores it. The language model gets reinforcement: high scores → encourage those outputs, low scores → discourage them.

This is classical reinforcement learning:

Agent: Language model (GPT-3)
Environment: Text generation task
Reward: Score from the reward model
Objective: Maximize expected reward

# Simplified RL loop
language_model = GPT3()
reward_model = trained_reward_model  # from Step 2

for epoch in range(num_epochs):
    # Generate text from language model
    prompt = "Explain quantum entanglement to a 10-year-old"
    generated_text = language_model.generate(prompt)
    
    # Get reward score
    reward = reward_model(generated_text)
    
    # Update language model to maximize reward
    # (using PPO - Proximal Policy Optimization)
    loss = -reward  # negative because we're minimizing loss
    loss.backward()
    optimizer.step()

But there's a catch: if you purely optimize for reward, the model might find weird loopholes or degrade in capability.

Solution: Constrain the model to stay close to the SFT version using a technique called KL divergence penalty.

# Actual RL objective
reward = reward_model(generated_text)
kl_penalty = kl_divergence(language_model, sft_model)
loss = -reward + 0.02 * kl_penalty

This says: "maximize reward, but don't drift too far from the SFT baseline." It prevents reward hacking and preserves general capability.

Why This Actually Works

Alignment Without Explicit Rules

You can't write down "how to be helpful" in rules. What does it mean to give a helpful answer?

Accuracy
Clarity
Appropriate tone
Acknowledging uncertainty
Conciseness vs. detail (context-dependent)
Honesty even when it's uncomfortable
...dozens of other factors

RLHF learns these implicitly. The reward model sees thousands of human preferences and learns their underlying pattern.

Scaling Human Feedback

You can't have humans evaluate every output. So you:

Have humans compare a sample of outputs (~33K comparisons)
Train a model to predict human preferences
Use that model to reward the language model

This is scalable. Step 1 costs money but is manageable. Step 2 and 3 are computation.

Emergent Behavior

Something remarkable happens: the model learns concepts it was never explicitly trained on.

After RLHF, models become better at:

Admitting uncertainty ("I'm not sure, but...")
Declining harmful requests
Showing reasoning step-by-step
Correcting themselves

These behaviors emerge from optimizing for human preference, even though they weren't explicitly rewarded.

The Technical Picture

Reward Model Architecture

InstructGPT used a 6B parameter GPT-3 model with:

The final unembedding layer removed
A linear projection layer added to output a single scalar reward score
Input: prompt + model-generated response
Output: single scalar reward score

The model learns from contrastive pairs: "this output is better than that one."

Training objective:

# Given a pair (better_output, worse_output)
score_better = reward_model(better_output)
score_worse = reward_model(worse_output)

# Want: score_better > score_worse
loss = -log(sigmoid(score_better - score_worse))

This is binary classification in disguise: given two texts, predict which one humans prefer.

RL Algorithm: PPO

InstructGPT uses Proximal Policy Optimization (PPO), a standard RL algorithm.

PPO is stable and sample-efficient compared to simpler policy gradient methods. It prevents the model from changing too rapidly per update, which helps preserve capability.

Simplified version:

for epoch in range(epochs):
    # Rollout: generate completions for prompts
    prompts = load_batch(prompt_dataset)
    completions = language_model.generate(prompts)
    
    # Evaluate: get rewards
    rewards = reward_model(completions)
    
    # Update: optimize with PPO
    ppo_update(language_model, prompts, completions, rewards)

Hyperparameters

Batch size: 256-512 prompts per update
Learning rate: Small, ~1e-5 (don't want to overwrite pretraining)
KL coefficient: 0.02 (balance between reward and staying close to SFT)
RL epochs: Few iterations (1-5) over each batch
Prompt diversity: Mix in-distribution and out-of-distribution prompts

Results: InstructGPT vs GPT-3

Human Evaluation

Human labelers compared outputs from different models side by side. The results were measured as win rates — how often labelers preferred one model's output over another:

Comparison	Preferred InstructGPT
InstructGPT (175B) vs. GPT-3 (175B)	~85% of the time
InstructGPT (175B) vs. GPT-3 prompted (175B)	~71% of the time
InstructGPT (175B) vs. SFT (175B)	~61% of the time

The most striking result: the 1.3B parameter InstructGPT model was preferred over the 175B GPT-3 — a model 100× larger. RLHF made a tiny model more useful than a giant one.

Specific Behaviors

Writing Quality:

GPT-3 (SFT): Rambling, unclear, often grammatically awkward
InstructGPT: Concise, well-structured, polished

Truthfulness:

GPT-3: Confident lies ("Einstein discovered electricity")
InstructGPT: Admits uncertainty ("I don't know, but...")

Harm Avoidance:

GPT-3: Explains how to make explosives or synthesize drugs if asked
InstructGPT: Declines harmful requests while being respectful

Math/Logic:

GPT-3: Often wrong on arithmetic
InstructGPT: Frequently correct (and shows work)

Scaling

The technique scales across model sizes. OpenAI tested on 1.3B, 6B, and 175B parameter models. All sizes benefited from RLHF, with consistent preference improvements at every scale.

Why This Mattered: The Path to ChatGPT

InstructGPT was published by OpenAI in March 2022 — eight months before ChatGPT launched in November 2022.

InstructGPT was the technique that made ChatGPT possible. OpenAI described ChatGPT as a "sibling model" to InstructGPT, trained with the same RLHF pipeline on a more conversational dataset.

The pipeline is:

Pretraining (GPT-3 style)
Instruction tuning (FLAN style)
RLHF (InstructGPT style)

This became the standard. Claude, Gemini, and other modern assistants all use RLHF.

Why RLHF Mattered

Before InstructGPT:

Models were impressive but unpredictable
They would confidently hallucinate
They'd follow instructions to absurd extremes
Companies couldn't reliably control behavior

After InstructGPT:

Models were genuinely helpful
They admitted uncertainty
They declined harmful requests
Companies had a path to alignment

This wasn't just an incremental improvement. It was the difference between "impressive toy" and "actually usable assistant."

The Alignment Story

InstructGPT introduced a framework for alignment: use human feedback to shape model behavior.

This is important because:

Alignment is hard to specify: You can't write perfect rules for helpful behavior
Humans can judge better than they can describe: We know good answers when we see them
Scalable: Train a reward model once, use it to train many models

But it's not perfect:

Reward models can be gamed (model learns to exploit the reward signal)
Human raters have biases (biased feedback → biased reward model)
Changing the reward model can dramatically change behavior
You're only as good as your training data (biases accumulate)

Still, RLHF became the standard approach to alignment in LLMs. It's not perfect, but it works.

Practical Insights

Cost Breakdown (Estimated)

Approximate costs for RLHF training at GPT-3 scale (industry estimates, not from the paper):

Pretraining: $5-10 million (one-time)
Instruction data: $50K-100K (13K examples × human annotation)
Reward model training: $100K+ (GPU time)
RLHF: $500K-1M+ (lots of RL sampling and compute)

This is expensive but much cheaper than the pretraining. Practical for companies with scale.

Why You See This Everywhere

RLHF works. Modern models use it. Why?

Simple concept: Humans prefer X, reward X
Empirically effective: +2-3 point improvements are real
Scalable: Doesn't require model retraining from scratch
Adaptable: Can swap reward models or feedback sources

The Bigger Picture

From Language Modeling to Assistance

The three-step pipeline (pretraining → instruction tuning → RLHF) is how we went from:

GPT-2/3 era (2018-2020):

You: "Write a story about dragons"
GPT-3: "Write a story about dragons flying in the sky, breathing fire on 
castles, fighting knights, and... [continues for pages unprompted]"
You: [frustrated] "I didn't ask for all that"

ChatGPT era (2022+):

You: "Write a short story about dragons"
ChatGPT: "# The Lonely Dragon

In a valley of grey stone lived Ember, a dragon with scales of silver..."
[200 words, exactly what was asked for, engaging, well-written]

RLHF is a large part of why the second version exists.

Open Questions

RLHF doesn't solve everything:

Hallucination: Models still make things up; RLHF helps but doesn't eliminate it
Adversarial robustness: Careful prompting can still break alignment
Long-horizon reasoning: Hard to reward long chains of reasoning with short feedback
Value misalignment: What if human raters are wrong? What if we optimize for the wrong thing?

These remain active research areas.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want ← You are here

InstructGPT: How AI Learned What Humans Actually Want

The Problem: Better Instructions, Same Problem

The Human-in-the-Loop Insight

The Three-Step Process

Step 1: Supervised Fine-Tuning (SFT)

Step 2: Reward Modeling

Step 3: Reinforcement Learning (RLHF)

Why This Actually Works

Alignment Without Explicit Rules

Scaling Human Feedback

Emergent Behavior

The Technical Picture

Reward Model Architecture

RL Algorithm: PPO

Hyperparameters

Results: InstructGPT vs GPT-3

Human Evaluation

Specific Behaviors

Scaling

Why This Mattered: The Path to ChatGPT

Why RLHF Mattered

The Alignment Story

Practical Insights

Cost Breakdown (Estimated)

Why You See This Everywhere

The Bigger Picture

From Language Modeling to Assistance

Open Questions

Series Navigation

InstructGPT: How AI Learned What Humans Actually Want

The Problem: Better Instructions, Same Problem

The Human-in-the-Loop Insight

The Three-Step Process

Step 1: Supervised Fine-Tuning (SFT)

Step 2: Reward Modeling

Step 3: Reinforcement Learning (RLHF)

Why This Actually Works

Alignment Without Explicit Rules

Scaling Human Feedback

Emergent Behavior

The Technical Picture

Reward Model Architecture

RL Algorithm: PPO

Hyperparameters

Results: InstructGPT vs GPT-3

Human Evaluation

Specific Behaviors

Scaling

Why This Mattered: The Path to ChatGPT

Why RLHF Mattered

The Alignment Story

Practical Insights

Cost Breakdown (Estimated)

Why You See This Everywhere

The Bigger Picture

From Language Modeling to Assistance

Open Questions

Series Navigation