Blog - Understanding GEPA - Prompt Evolution

What Problem Does GEPA Solve?

Imagine you're writing a prompt for an AI to answer math questions. Your first attempt might be: "Solve this math problem: {question}"

This works okay, but not great. So you try: "You are a math expert. Think step by step. Solve: {question}"

Better! But still not perfect. You keep tweaking, testing, tweaking, testing... This manual process is exhausting. GEPA automates it.

The Core Idea: Treat Prompts Like Living Things

GEPA borrows ideas from biological evolution. In nature:

Creatures are born with DNA (their "instructions")
They live and get tested by the environment
The ones that survive pass on their DNA
Their children have slightly modified DNA (mutations)
Over generations, creatures get better at surviving

GEPA does the same thing, but with prompts instead of creatures:

A prompt is born (your initial attempt)
It gets tested on real tasks
Good prompts "survive" and get to create children
Children are modified versions (mutations)
Over iterations, prompts get better at the task

Key Terms Explained

Candidate

A "candidate" is simply one version of your prompt (or set of prompts). Think of it like a job applicant. You have many candidates applying for the job of "best prompt." Each candidate is trying to prove they're the best.

# A candidate is just a dictionary mapping names to text
candidate = {
    "system_prompt": "You are a helpful math tutor...",
    "format_instructions": "Show your work step by step..."
}

Mutation

In biology, mutation means a small random change to DNA. In GEPA, mutation means changing the prompt text.

But here's the clever part: GEPA doesn't make random changes. It uses an LLM to intelligently suggest improvements based on what went wrong.

Example of mutation:

BEFORE (Parent):
"Solve this math problem."

The LLM notices: "This prompt fails on word problems
because it doesn't tell the model to identify what's being asked."

AFTER (Child/Mutant):
"Read the problem carefully. Identify what is being asked.
Then solve step by step."

The child prompt is a "mutation" of the parent — similar, but improved.

Selection

You can't keep every prompt forever — you'd have thousands. Selection means choosing which prompts to keep and which to discard.

Think of it like a talent show:

Round 1: 100 contestants
Round 2: Keep the best 20
Round 3: Keep the best 5
Winner: The single best

GEPA uses selection to focus on promising prompts rather than wasting time on bad ones.

Pareto Frontier

This is a crucial concept. Let me explain with an example.

Imagine you're rating restaurants on two criteria: taste and price.

Restaurant	Taste	Price (lower=better)
A	9	8 (expensive)
B	7	3 (cheap)
C	6	7
D	8	4

Which is "best"? It depends on what you care about!

If you want the tastiest: A
If you want the cheapest: B
If you want a balance: D

Restaurant C is clearly worse than D (D beats it on BOTH taste AND price). We say D "dominates" C.

The Pareto frontier is the set of options where no other option beats them on ALL criteria. Here, the frontier is {A, B, D}. These are all "good" in different ways.

In GEPA: Different prompts might excel on different types of problems. One prompt might be great at algebra but bad at word problems. Another might be the opposite. The Pareto frontier keeps BOTH because each is "best" at something.

Reflection

This is GEPA's secret weapon. Instead of blind trial-and-error, GEPA thinks about why things failed.

Imagine a student who gets a math test back:

Bad approach: "I got 70%. I'll just study more."
Reflective approach: "I got 70%. Let me look at what I got wrong... Ah, I keep making sign errors in negative numbers. I need to be more careful with negatives."

GEPA does the reflective approach. It looks at failed examples and asks an LLM: "What went wrong? How can we fix the prompt?"

Trajectory / Trace

A trajectory is the full record of what happened when running the prompt.

Think of it like a recipe attempt:

Input: "Make chocolate cake"
Step 1: Got flour ✓
Step 2: Added sugar ✓
Step 3: Set oven to 500°F ✗ (too hot!)
Step 4: Cake burned
Output: Burned cake
Score: 2/10

The trajectory includes ALL the steps, not just the final score. This helps GEPA understand WHERE things went wrong (Step 3), not just THAT things went wrong.

Epoch

An epoch means one complete pass through all your training data.

If you have 100 training examples and process them in batches of 10:

Batch 1: examples 1-10
Batch 2: examples 11-20
...
Batch 10: examples 91-100
End of Epoch 1
Shuffle the data
Batch 1: examples 47, 3, 82... (random order now)
...and so on

Minibatch

Instead of testing a prompt on ALL your data at once (slow and expensive), you test on a small minibatch — maybe just 3-5 examples.

This is like a chef tasting a dish while cooking rather than waiting until the entire banquet is prepared.

Why Each Component Exists

Let me map the code components to their purpose:

Component	Real-World Analogy	Purpose
`GEPAAdapter`	Translator	Connects GEPA to YOUR specific system
`CandidateSelector`	Talent scout	Picks which prompts to improve next
`ReflectiveMutationProposer`	Writing coach	Suggests improvements to prompts
`BatchSampler`	Test administrator	Picks which examples to test on
`ComponentSelector`	Focus advisor	Decides which part of prompt to edit
`StopperProtocol`	Race official	Decides when optimization is done
`EvaluationPolicy`	Grading system	Decides how to score prompts

The Full Process, Step by Step

Let me walk through what actually happens:

Step 0: You Provide a Starting Point

Your seed prompt: "Answer the question: {`{question}`}"
Your training data: 100 math problems with answers
Your metric: "Did the model get the right answer?" (0 or 1)

Step 1: Test the Starting Prompt

GEPA runs your prompt on a few examples (a minibatch):

Example 1: "What is 2+2?" → Model says "4" → Score: 1 ✓
Example 2: "If John has 3 apples..." → Model says "7" → Score: 0 ✗
Example 3: "Solve: 3x = 9" → Model says "x = 3" → Score: 1 ✓

Average score: 0.67 (2 out of 3 correct)

Step 2: Reflect on Failures

GEPA asks a smart LLM: "Here's the prompt and what happened. Why did Example 2 fail?"

The reflection LLM analyzes:

"The prompt failed on the word problem because it doesn't
instruct the model to:
1. Identify the key information
2. Set up an equation
3. Solve systematically

The model just guessed instead of reasoning through it."

Step 3: Propose a Mutation

Based on the reflection, GEPA generates a new prompt:

OLD: "Answer the question: {`{question}`}"

NEW: "Read the problem carefully. Identify the key numbers
and what is being asked. Set up your approach, then solve
step by step. Question: {`{question}`}"

Step 4: Test the New Prompt

The mutated prompt is tested:

Example 1: Score 1 ✓
Example 2: Score 1 ✓ (now works!)
Example 3: Score 1 ✓

Average score: 1.0 (3 out of 3 correct)

Step 5: Selection — Keep the Good Ones

Now GEPA has two candidates:

Original prompt: scores 0.67
Mutated prompt: scores 1.0

The mutated prompt is clearly better, so it becomes the new "best."

But GEPA might keep BOTH if they each excel on different types of problems (Pareto frontier).

Step 6: Repeat

Go back to Step 1 with the new best prompt(s). Keep improving until:

You've used up your budget (max evaluations)
The prompt is "good enough"
No more improvement is happening

Step 7: Return the Best

After many iterations, GEPA returns the best prompt it found.

Why Is Selection Needed?

Imagine you're breeding dogs for speed. You have 100 dogs.

Without selection: You let ALL dogs breed randomly. Slow dogs have puppies too. After 10 generations, average speed hasn't improved much.

With selection: You only let the TOP 10 fastest dogs breed. After 10 generations, average speed has improved dramatically.

Selection focuses your resources on the most promising candidates. Without it, you waste time trying to improve bad prompts instead of building on good ones.

Why Is Mutation Needed?

If you only kept the best prompt and never changed it, you'd be stuck. The original prompt has fundamental limitations.

Mutation creates variation — new ideas to try. Most mutations might be worse, but occasionally one is better. Selection keeps the better ones.

It's like brainstorming:

Generate many ideas (mutation)
Pick the best ones (selection)
Build on those (repeat)

Why Pareto Frontier Instead of Just "The Best"?

Simple "pick the best" has a problem: it gets stuck.

Imagine prompt A scores 90% on algebra but 50% on word problems. Prompt B scores 60% on both. By average, A (70%) beats B (60%).

But what if you evolved B and it became C, which scores 85% on both? That beats A!

If you only kept A, you'd never discover C. The Pareto frontier keeps diverse prompts alive, allowing you to discover solutions you'd miss otherwise.

Visual Summary

Loading diagram...

What Makes GEPA Special?

Most prompt optimizers try random changes or simple patterns. GEPA is different because:

It thinks about WHY things failed (reflection), not just THAT they failed
It keeps diverse solutions (Pareto frontier), not just the single best
It uses language to improve language — an LLM suggests prompt improvements

GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs.

What Problem Does GEPA Solve?

Imagine you're writing a prompt for an AI to answer math questions. Your first attempt might be: "Solve this math problem: {question}"

This works okay, but not great. So you try: "You are a math expert. Think step by step. Solve: {question}"

Better! But still not perfect. You keep tweaking, testing, tweaking, testing... This manual process is exhausting. GEPA automates it.

The Core Idea: Treat Prompts Like Living Things

GEPA borrows ideas from biological evolution. In nature:

Creatures are born with DNA (their "instructions")
They live and get tested by the environment
The ones that survive pass on their DNA
Their children have slightly modified DNA (mutations)
Over generations, creatures get better at surviving

GEPA does the same thing, but with prompts instead of creatures:

A prompt is born (your initial attempt)
It gets tested on real tasks
Good prompts "survive" and get to create children
Children are modified versions (mutations)
Over iterations, prompts get better at the task

Key Terms Explained

Candidate

# A candidate is just a dictionary mapping names to text
candidate = {
    "system_prompt": "You are a helpful math tutor...",
    "format_instructions": "Show your work step by step..."
}

Mutation

In biology, mutation means a small random change to DNA. In GEPA, mutation means changing the prompt text.

But here's the clever part: GEPA doesn't make random changes. It uses an LLM to intelligently suggest improvements based on what went wrong.

Example of mutation:

BEFORE (Parent):
"Solve this math problem."

The LLM notices: "This prompt fails on word problems
because it doesn't tell the model to identify what's being asked."

AFTER (Child/Mutant):
"Read the problem carefully. Identify what is being asked.
Then solve step by step."

The child prompt is a "mutation" of the parent — similar, but improved.

Selection

You can't keep every prompt forever — you'd have thousands. Selection means choosing which prompts to keep and which to discard.

Think of it like a talent show:

Round 1: 100 contestants
Round 2: Keep the best 20
Round 3: Keep the best 5
Winner: The single best

GEPA uses selection to focus on promising prompts rather than wasting time on bad ones.

Pareto Frontier

This is a crucial concept. Let me explain with an example.

Imagine you're rating restaurants on two criteria: taste and price.

Restaurant	Taste	Price (lower=better)
A	9	8 (expensive)
B	7	3 (cheap)
C	6	7
D	8	4

Which is "best"? It depends on what you care about!

If you want the tastiest: A
If you want the cheapest: B
If you want a balance: D

Restaurant C is clearly worse than D (D beats it on BOTH taste AND price). We say D "dominates" C.

The Pareto frontier is the set of options where no other option beats them on ALL criteria. Here, the frontier is {A, B, D}. These are all "good" in different ways.

Reflection

This is GEPA's secret weapon. Instead of blind trial-and-error, GEPA thinks about why things failed.

Imagine a student who gets a math test back:

Bad approach: "I got 70%. I'll just study more."
Reflective approach: "I got 70%. Let me look at what I got wrong... Ah, I keep making sign errors in negative numbers. I need to be more careful with negatives."

GEPA does the reflective approach. It looks at failed examples and asks an LLM: "What went wrong? How can we fix the prompt?"

Trajectory / Trace

A trajectory is the full record of what happened when running the prompt.

Think of it like a recipe attempt:

Input: "Make chocolate cake"
Step 1: Got flour ✓
Step 2: Added sugar ✓
Step 3: Set oven to 500°F ✗ (too hot!)
Step 4: Cake burned
Output: Burned cake
Score: 2/10

The trajectory includes ALL the steps, not just the final score. This helps GEPA understand WHERE things went wrong (Step 3), not just THAT things went wrong.

Epoch

An epoch means one complete pass through all your training data.

If you have 100 training examples and process them in batches of 10:

Batch 1: examples 1-10
Batch 2: examples 11-20
...
Batch 10: examples 91-100
End of Epoch 1
Shuffle the data
Batch 1: examples 47, 3, 82... (random order now)
...and so on

Minibatch

Instead of testing a prompt on ALL your data at once (slow and expensive), you test on a small minibatch — maybe just 3-5 examples.

This is like a chef tasting a dish while cooking rather than waiting until the entire banquet is prepared.

Why Each Component Exists

Let me map the code components to their purpose:

Component	Real-World Analogy	Purpose
`GEPAAdapter`	Translator	Connects GEPA to YOUR specific system
`CandidateSelector`	Talent scout	Picks which prompts to improve next
`ReflectiveMutationProposer`	Writing coach	Suggests improvements to prompts
`BatchSampler`	Test administrator	Picks which examples to test on
`ComponentSelector`	Focus advisor	Decides which part of prompt to edit
`StopperProtocol`	Race official	Decides when optimization is done
`EvaluationPolicy`	Grading system	Decides how to score prompts

The Full Process, Step by Step

Let me walk through what actually happens:

Step 0: You Provide a Starting Point

Your seed prompt: "Answer the question: {`{question}`}"
Your training data: 100 math problems with answers
Your metric: "Did the model get the right answer?" (0 or 1)

Step 1: Test the Starting Prompt

GEPA runs your prompt on a few examples (a minibatch):

Example 1: "What is 2+2?" → Model says "4" → Score: 1 ✓
Example 2: "If John has 3 apples..." → Model says "7" → Score: 0 ✗
Example 3: "Solve: 3x = 9" → Model says "x = 3" → Score: 1 ✓

Average score: 0.67 (2 out of 3 correct)

Step 2: Reflect on Failures

GEPA asks a smart LLM: "Here's the prompt and what happened. Why did Example 2 fail?"

The reflection LLM analyzes:

"The prompt failed on the word problem because it doesn't
instruct the model to:
1. Identify the key information
2. Set up an equation
3. Solve systematically

The model just guessed instead of reasoning through it."

Step 3: Propose a Mutation

Based on the reflection, GEPA generates a new prompt:

OLD: "Answer the question: {`{question}`}"

NEW: "Read the problem carefully. Identify the key numbers
and what is being asked. Set up your approach, then solve
step by step. Question: {`{question}`}"

Step 4: Test the New Prompt

The mutated prompt is tested:

Example 1: Score 1 ✓
Example 2: Score 1 ✓ (now works!)
Example 3: Score 1 ✓

Average score: 1.0 (3 out of 3 correct)

Step 5: Selection — Keep the Good Ones

Now GEPA has two candidates:

Original prompt: scores 0.67
Mutated prompt: scores 1.0

The mutated prompt is clearly better, so it becomes the new "best."

But GEPA might keep BOTH if they each excel on different types of problems (Pareto frontier).

Step 6: Repeat

Go back to Step 1 with the new best prompt(s). Keep improving until:

You've used up your budget (max evaluations)
The prompt is "good enough"
No more improvement is happening

Step 7: Return the Best

After many iterations, GEPA returns the best prompt it found.

Why Is Selection Needed?

Imagine you're breeding dogs for speed. You have 100 dogs.

Without selection: You let ALL dogs breed randomly. Slow dogs have puppies too. After 10 generations, average speed hasn't improved much.

With selection: You only let the TOP 10 fastest dogs breed. After 10 generations, average speed has improved dramatically.

Selection focuses your resources on the most promising candidates. Without it, you waste time trying to improve bad prompts instead of building on good ones.

Why Is Mutation Needed?

If you only kept the best prompt and never changed it, you'd be stuck. The original prompt has fundamental limitations.

Mutation creates variation — new ideas to try. Most mutations might be worse, but occasionally one is better. Selection keeps the better ones.

It's like brainstorming:

Generate many ideas (mutation)
Pick the best ones (selection)
Build on those (repeat)

Why Pareto Frontier Instead of Just "The Best"?

Simple "pick the best" has a problem: it gets stuck.

Imagine prompt A scores 90% on algebra but 50% on word problems. Prompt B scores 60% on both. By average, A (70%) beats B (60%).

But what if you evolved B and it became C, which scores 85% on both? That beats A!

If you only kept A, you'd never discover C. The Pareto frontier keeps diverse prompts alive, allowing you to discover solutions you'd miss otherwise.

Visual Summary

graph TD A([Start: Initial Prompt]) --> B[Test & Score] B --> C[Reflect: Analysis] C --> D[Mutate: Smart Edits] D --> E[Select: Best Prompts] E --> F{Done?} F -->|Yes| G([Finish: Optimized Prompt]) F -->|No| B style A fill:#f0f7ff,stroke:#3b82f6,stroke-width:2px,color:#1e40af style B fill:#f8fafc,stroke:#cbd5e1,stroke-width:1px style C fill:#f8fafc,stroke:#cbd5e1,stroke-width:1px style D fill:#f8fafc,stroke:#cbd5e1,stroke-width:1px style E fill:#f8fafc,stroke:#cbd5e1,stroke-width:1px style F fill:#eff6ff,stroke:#60a5fa,stroke-width:1px style G fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#166534

Loading diagram...

What Makes GEPA Special?

Most prompt optimizers try random changes or simple patterns. GEPA is different because:

It thinks about WHY things failed (reflection), not just THAT they failed
It keeps diverse solutions (Pareto frontier), not just the single best
It uses language to improve language — an LLM suggests prompt improvements

GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs.

What Problem Does GEPA Solve?

The Core Idea: Treat Prompts Like Living Things

Key Terms Explained

Candidate

Mutation

Selection

Pareto Frontier

Reflection

Trajectory / Trace

Epoch

Minibatch

Why Each Component Exists

The Full Process, Step by Step

Step 0: You Provide a Starting Point

Step 1: Test the Starting Prompt

Step 2: Reflect on Failures

Step 3: Propose a Mutation

Step 4: Test the New Prompt

Step 5: Selection — Keep the Good Ones

Step 6: Repeat

Step 7: Return the Best

Why Is Selection Needed?

Why Is Mutation Needed?

Why Pareto Frontier Instead of Just "The Best"?

Visual Summary

What Makes GEPA Special?

How Training and Validation Work in GEPA

DSPy Optimization Algorithms: Step-by-Step Guide

What Problem Does GEPA Solve?

The Core Idea: Treat Prompts Like Living Things

Key Terms Explained

Candidate

Mutation

Selection

Pareto Frontier

Reflection

Trajectory / Trace

Epoch

Minibatch

Why Each Component Exists

The Full Process, Step by Step

Step 0: You Provide a Starting Point

Step 1: Test the Starting Prompt

Step 2: Reflect on Failures

Step 3: Propose a Mutation

Step 4: Test the New Prompt

Step 5: Selection — Keep the Good Ones

Step 6: Repeat

Step 7: Return the Best

Why Is Selection Needed?

Why Is Mutation Needed?

Why Pareto Frontier Instead of Just "The Best"?

Visual Summary

What Makes GEPA Special?

How Training and Validation Work in GEPA

DSPy Optimization Algorithms: Step-by-Step Guide