Blog - DSPy Optimization Algorithms: Step-by-Step Guide

The Problem: Manual Prompt Engineering Doesn't Scale

You've written a prompt. It works... sometimes. So you tweak it. Test it. Tweak it again. Add "think step by step." Remove it. Add examples. Remove them. Try different phrasings.

After hours of this, you've maybe improved accuracy by 5%. And you have no idea if that improvement will hold on new data.

This is the prompt engineering treadmill. DSPy's optimizers are designed to get you off it.

What DSPy Optimizers Actually Do

Every DSPy optimizer answers the same question: "Given my task and some training data, what should my prompt look like?"

But they attack it from three different angles:

Approach	What Gets Optimized	Optimizers
Few-shot demos	"What examples should I show?"	LabeledFewShot, BootstrapFewShot, BootstrapFewShotWithRandomSearch, KNNFewShot
Instructions	"How should I describe the task?"	COPRO, MIPROv2, GEPA
Model weights	"Can I bake this into the model itself?"	BootstrapFinetune

Quick Reference: When to Use What

Before diving deep, here's the practical cheat sheet:

Few-Shot Demo Optimizers

LabeledFewShot - Simply grabs k random examples from training data. No validation, no magic.
BootstrapFewShot - The search algorithm goes through every training input in the trainset, gets a prediction and then checks whether it "passes" the metric. If it passes, the examples are added to the compiled program. It captures full traces including intermediate reasoning.
BootstrapFewShotWithRandomSearch - This creates many candidate programs by shuffling the training set and picking the best one. It tries different numbers of bootstrapped samples and evaluates each on the validation set.
KNNFewShot - Uses an in-memory KNN retriever to find the k nearest neighbors at test time. For each input, it identifies the most similar examples and attaches them as demonstrations.

Instruction Optimizers

COPRO - Uses coordinate ascent (hill-climbing) to optimize instructions. It generates multiple alternative instructions, evaluates them using your metric, and iteratively selects the best-performing ones.
MIPROv2 - Works in three stages: 1) Bootstrap few-shot examples. 2) Propose instruction candidates. 3) Find an optimized combination of both using Bayesian Optimization.
GEPA — Reflective evolution using natural language feedback. Analyzes WHY things failed, not just THAT they failed. Use when you need sample-efficient optimization or have rich feedback signals.

Weight Optimizer

BootstrapFinetune - A weight-optimizing variant that invokes a teacher program to produce reasoning, then uses those traces to fine-tune a student program.

1. LabeledFewShot

Optimizes: Few-shot Demonstrations Complexity: Simple Best For: Quick baseline, any data size

The Idea

LabeledFewShot is the simplest optimizer—it just grabs examples directly from your training data and prepends them to prompts.

Step-by-Step

Initialize: Specify k (number of examples to include in each prompt)
Random Selection: Randomly select k examples from your trainset
Format Examples: Convert selected examples into the format expected by your module's signature
Prepend to Prompt: These examples are prepended to every prompt sent to the LM

Key Characteristics:

No validation—examples are used as-is
Examples are "raw" (no intermediate reasoning steps)
Results may vary between runs due to random selection
No bootstrapping or augmentation

optimizer = LabeledFewShot(k=8)
compiled = optimizer.compile(student=your_program, trainset=trainset)

When It Works

Your examples are already high-quality. You need a baseline fast. You're testing whether few-shot helps at all before investing in fancier optimization.

When It Fails

Your training data has noise. Some examples are misleading. Different inputs need different types of examples.

2. BootstrapFewShot

Optimizes: Few-shot Demonstrations Complexity: General Best For: ~10 examples, when you need validated demonstrations

The Idea

BootstrapFewShot uses a "teacher" program to generate high-quality demonstrations that pass a metric, then uses these to train a "student" program.

Step-by-Step

Setup Teacher: If no teacher is provided, the student program itself becomes the teacher (optionally compiled with LabeledFewShot first)
Enable Tracing: Turn on compile mode to track all internal Predict calls
Iterate Through Training Examples:

for each example in trainset:
    if we_found_enough_bootstrapped_demos():
        break

    # Run teacher on example
    prediction = teacher(**example.inputs())

    # Capture traces of all internal module calls
    predicted_traces = dspy.settings.trace

Validate with Metric: For each prediction, check if it passes the metric:

if metric(example, prediction, predicted_traces):
    # This is a valid demonstration!
    for predictor, inputs, outputs in predicted_traces:
        demo = dspy.Example(automated=True, **inputs, **outputs)
        compiled_program[predictor_name].demonstrations.append(demo)

Collect Valid Demos: Only demonstrations that pass the metric are kept. This continues until max_bootstrapped_demos valid traces are collected.
Combine with Labeled: The final prompt includes:
- Up to max_labeled_demos raw examples from trainset
- Up to max_bootstrapped_demos validated, augmented traces

Key Parameters:

max_bootstrapped_demos=4: Augmented traces to generate
max_labeled_demos=16: Raw demos from training set
max_rounds=1: Attempts per example
metric: Function to validate outputs

Why "Bootstrapped" Demos Are Special: Unlike labeled demos, bootstrapped demos include intermediate reasoning (chain-of-thought, retrieved context, etc.) because they capture the full trace of what the teacher did.

3. BootstrapFewShotWithRandomSearch

Optimizes: Few-shot Demonstrations Complexity: Higher Best For: 50+ examples, when you want to find optimal demo combinations

The Idea

This optimizer runs BootstrapFewShot multiple times with different random configurations and selects the best-performing program.

###Step-by-Step:

Create Candidate Pool: Generate multiple candidate programs:
- Uncompiled (baseline) program
- LabeledFewShot optimized program
- BootstrapFewShot with unshuffled examples
- num_candidate_programs × BootstrapFewShot with randomized/shuffled examples

For Each Candidate:

for candidate_config in candidate_pool:
    # Shuffle training set differently each time
    shuffled_trainset = shuffle(trainset)

    # Vary number of bootstrapped demos
    num_demos = random_choice(range(1, max_bootstrapped_demos))

    # Compile with BootstrapFewShot
    candidate = BootstrapFewShot.compile(student, shuffled_trainset)

Evaluate All Candidates: Run each candidate program on the validation set and compute metric scores
Select Best: Return the program that achieved the highest score on the validation set

Key Parameters:

num_candidate_programs=10: Number of random configurations to try
All BootstrapFewShot parameters also apply

Why Random Search Helps: Different subsets of examples work better for different tasks. Random search explores this space efficiently without expensive exhaustive search.

4. KNNFewShot

Optimizes: Few-shot Demonstrations (dynamically) Complexity: Moderate Best For: When different inputs benefit from different examples

The Idea

Instead of using fixed examples, KNNFewShot dynamically selects the most relevant examples for each input using semantic similarity.

Step-by-Step

Initialize Embeddings:

# Create embedder (e.g., using SentenceTransformers)
embedder = Embedder(SentenceTransformer("all-MiniLM-L6-v2").encode)

# Embed all training examples
for example in trainset:
    example_vector = embedder.encode(example.inputs())
    store_in_index(example_vector, example)

At Inference Time (for each new input):

a. Convert input to vector:
   query_vector = embedder.encode(new_input)

b. Find k nearest neighbors in the training set index

c. Retrieve the k most similar training examples

d. Use these as demonstrations for THIS specific query

Dynamic Prompts: Each different input gets a customized prompt with the most relevant examples

Key Parameters:

k=3: Number of nearest neighbors to retrieve
trainset: Examples to search through
vectorizer: Embedding function

When to Use:

Your task has distinct "types" of inputs
Different categories benefit from different examples
One-size-fits-all demos don't work well

5. COPRO (Cooperative Prompt Optimization)

Optimizes: Instructions Complexity: Moderate Best For: Zero-shot optimization, when you need better task descriptions

How It Works

COPRO optimizes the instruction text (not examples) using coordinate ascent (hill-climbing).

Step-by-Step:

Initialize: Start with the current instruction for each predictor in your program

For Each Depth Iteration (outer loop):

for d in range(depth):
    temperature = init_temperature * (0.9 ** d)  # Cool down over time

Generate Candidates (inner loop for each predictor):

for each predictor in program:
    # Use LM to propose new instructions
    for i in range(breadth):
        new_instruction = prompt_model.generate(
            context = [
                "Previous instructions and scores",
                "Propose a better instruction"
            ],
            temperature = temperature
        )
        candidates.append(new_instruction)

Evaluate Each Candidate:

for candidate_instruction in candidates:
    # Create program copy with this instruction
    test_program = program.copy()
    test_program.predictor.instruction = candidate_instruction

    # Evaluate on training set
    score = evaluate(test_program, trainset, metric)

Select Best: Keep the instruction with the highest score
Coordinate Ascent: Repeat the process, using the history of tried instructions to inform new proposals

Key Parameters:

breadth=10: Number of candidate instructions per iteration
depth=3: Number of refinement iterations
init_temperature=1.4: Controls creativity (higher = more diverse)
prompt_model: LM used to generate instruction candidates

The Meta-Prompt Used:

"You are an instruction optimizer for large language models.
I will give some task instructions I've tried, along with their
corresponding validation scores. The instructions are arranged
in increasing order based on their scores, where higher scores
indicate better quality.

Your task is to propose a new instruction that will lead a good
language model to perform the task even better. Don't be afraid
to be creative."

6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)

Optimizes: Instructions AND Demonstrations (jointly) Complexity: High Best For: 200+ examples, when you want the best possible optimization

How It Works

MIPROv2 is the comprehensive optimizer. It jointly optimizes both instructions AND demonstrations using three stages and Bayesian optimization.

Stage 1: Bootstrap Few-Shot Examples

for i in range(num_candidates):
    candidate_demos = []
    for example in random_sample(trainset):
        output = program(**example.inputs())
        if metric(example, output) > metric_threshold:
            candidate_demos.append(trace)
        if len(candidate_demos) >= max_bootstrapped_demos:
            break
    demo_sets[i] = candidate_demos

Stage 2: Propose Instruction Candidates (Grounded Proposal)

This stage is "data-aware and demonstration-aware":

#### Gather context for instruction generation
context = {
    "dataset_summary": summarize_training_data(trainset),
    "program_code": inspect_dspy_program(program),
    "example_traces": bootstrapped_demos,
    "predictor_being_optimized": predictor_name
}

#### Generate multiple instruction candidates
for i in range(num_candidates):
    instruction = prompt_model.generate(
        f"Given this context: {context},
         write an effective instruction"
    )
    instruction_candidates.append(instruction)

Stage 3: Bayesian Optimization Search

for trial in range(num_trials):
    combo = bayesian_optimizer.suggest({
        "instruction_idx": range(num_candidates),
        "demo_set_idx": range(num_demo_sets)
    })

    test_program = build_program(
        instruction = instructions[combo.instruction_idx],
        demos = demo_sets[combo.demo_set_idx]
    )

    # Evaluate on minibatch
    score = evaluate(test_program, minibatch, metric)
    bayesian_optimizer.tell(combo, score)

Key Parameters:

num_candidates=7: Instruction/demo candidates to generate
num_trials=15-50: Bayesian optimization iterations
minibatch_size=25: Examples per trial evaluation
minibatch_full_eval_steps=10: How often to do full evaluation
auto="light"|"medium"|"heavy": Preset configurations

Why Bayesian Optimization?

Method	Strategy	Efficiency
Grid Search	Try every combination	Terrible for large spaces
Random Search	Try random combinations	Better, but blind
Bayesian	Learn from each trial, focus on promising areas	Smart

7. GEPA (Genetic-Pareto Reflective Optimizer)

Optimizes: Instructions (with rich feedback) Complexity: Sophisticated Data needed: ~10+ examples (sample-efficient!)

The Idea

GEPA is fundamentally different from other optimizers. Instead of blind trial-and-error, it reflects on failures. It asks: "WHY did this fail? What specifically went wrong? How can we fix it?"

This is like the difference between a student who just retakes tests hoping for better luck, versus one who reviews their mistakes and learns from them.

The Core Innovation: Reflection + Pareto Frontier

GEPA maintains a Pareto frontier of candidates — prompts that each excel at something. Rather than converging on one "best" prompt, it keeps diverse specialists alive.

Each iteration:

Sample a candidate from the Pareto frontier
Run it on a minibatch, collect traces AND feedback
Reflect: analyze what went wrong
Mutate: propose an improved instruction
Evaluate: if better, add to frontier

Step-by-Step

1. Sample from Pareto Frontier

# Candidates weighted by "coverage" - how many examples they're best on
candidate = weighted_sample(pareto_frontier)

2. Collect Traces + Feedback

for example in minibatch:
    output = candidate(**example.inputs())
    trace = capture_reasoning()
    feedback = metric.get_feedback(example, output)  # Rich text feedback!

3. Reflect on Failures

Here's the magic. GEPA sends this to a reflection LLM:

"Current instruction: [instruction]

Here's what happened:
- Example 1: FAILED. Feedback: 'Model didn't identify the variable
  to solve for, just computed 7+3 instead of recognizing x+3=7'
- Example 2: FAILED. Feedback: 'Multiplied instead of dividing'
- Example 3: SUCCESS.

Analyze what went wrong. Propose an improved instruction."

4. Mutate Based on Reflection

new_instruction = reflection_lm.generate(
    reflection_prompt + traces + feedback
)

5. Update Pareto Frontier

if new_candidate_is_non_dominated():
    pareto_frontier.add(new_candidate)
    remove_dominated_candidates()

Why GEPA Outperforms

Aspect	Traditional Optimizers	GEPA
Feedback	Scalar score (0.73)	Rich text ("failed because...")
Search	Random or gradient-based	Reflection-guided
Diversity	Converge to single best	Pareto frontier of specialists
Efficiency	Needs many rollouts	35× fewer rollouts than GRPO

Results from the paper:

Outperforms GRPO (reinforcement learning) by 10% on average, up to 20%
Uses up to 35× fewer rollouts
Outperforms MIPROv2 by 10%+ across benchmarks

When to Use GEPA

You have rich feedback signals (not just pass/fail)
Sample efficiency matters (limited budget)
Your task has diverse subtypes (Pareto frontier helps)
You want interpretable improvements (reflection explains changes)

Key Parameters:

auto="light"|"medium"|"heavy": Optimization intensity
reflection_lm: LLM used for reflection (can differ from task LLM)
reflection_minibatch_size=3: Examples per reflection round
skip_perfect_score=True: Don't reflect on successes

optimizer = dspy.GEPA(
    metric=my_metric_with_feedback,
    auto="medium",
    reflection_lm=dspy.LM("openai/gpt-4o")
)
compiled = optimizer.compile(student=program, trainset=trainset, valset=valset)

8. BootstrapFinetune

Optimizes: Model Weights Complexity: High Data needed: 200+ examples

The Idea

BootstrapFinetune is different from all the others. Instead of changing prompts, it changes the model itself. It distills a prompt-optimized program into actual weight updates.

Step-by-Step

Setup Teacher/Student:

teacher_program = optimized_large_model_program  # e.g., GPT-4o
student_program = program.deepcopy()
student_program.set_lm(small_model)  # e.g., Llama-7B

Bootstrap Training Data:

training_data = []

for example in trainset:
    # Run teacher program
    output = teacher_program(**example.inputs())

    # Capture full trace (input/output for each module)
    traces = get_all_module_traces()

    # Optionally filter with metric
    if metric is None or metric(example, output) > threshold:
        training_data.append(traces)

Create Multi-Task Dataset: Each module in the program becomes a separate "task":

finetune_dataset = []

for trace in training_data:
    for module_name, (input, output) in trace.items():
        finetune_dataset.append({
            "task": module_name,
            "input": input,
            "output": output
        })

Fine-tune the LM:

# Using standard LM fine-tuning
finetuned_lm = finetune(
    base_model = small_model,
    dataset = finetune_dataset,
    **train_kwargs  # epochs, learning_rate, etc.
)

Update Program: Replace LM in student program with finetuned version:
```
student_program.set_lm(finetuned_lm)
return student_program
```

Key Parameters:

metric: Optional filter for training traces
num_threads: Parallel processing for bootstrapping
train_kwargs: Fine-tuning hyperparameters (epochs, lr, etc.)

When to Use:

You've already optimized prompts with another optimizer
You need to run at scale (smaller models = cheaper)
Teacher model is expensive but available for data generation

Summary Comparison Table

Optimizer	Optimizes	Method	Data	Best For
LabeledFewShot	Demos	Random selection	Any	Quick baseline
BootstrapFewShot	Demos	Metric-filtered	~10	Validated demos
BootstrapFewShotWithRandomSearch	Demos	Random search	~50+	Optimal demo sets
KNNFewShot	Demos	Nearest neighbor	Any	Input-dependent demos
COPRO	Instructions	Coordinate ascent	~50+	Zero-shot optimization
MIPROv2	Both	Bayesian optimization	~200+	Comprehensive optimization
GEPA	Instructions	Reflective evolution	~10+	Sample-efficient, rich feedback
BootstrapFinetune	Weights	Gradient descent	~200+	Model distillation

Recommended Workflow

Start Simple: LabeledFewShot or BootstrapFewShot for quick baseline
Add Search: BootstrapFewShotWithRandomSearch if you have more data
Optimize Instructions: COPRO if you want zero-shot or clearer prompts
Go Comprehensive: MIPROv2 for good results with sufficient data
Reflect & Evolve: GEPA when you have rich feedback signals and want sample-efficient optimization
Distill: BootstrapFinetune to compress into smaller, faster models

The key insight is that these optimizers can be composed—run MIPROv2 or GEPA, then pass the result to BootstrapFinetune for a highly optimized, efficient program.

The Problem: Manual Prompt Engineering Doesn't Scale

You've written a prompt. It works... sometimes. So you tweak it. Test it. Tweak it again. Add "think step by step." Remove it. Add examples. Remove them. Try different phrasings.

After hours of this, you've maybe improved accuracy by 5%. And you have no idea if that improvement will hold on new data.

This is the prompt engineering treadmill. DSPy's optimizers are designed to get you off it.

What DSPy Optimizers Actually Do

Every DSPy optimizer answers the same question: "Given my task and some training data, what should my prompt look like?"

But they attack it from three different angles:

Approach	What Gets Optimized	Optimizers
Few-shot demos	"What examples should I show?"	LabeledFewShot, BootstrapFewShot, BootstrapFewShotWithRandomSearch, KNNFewShot
Instructions	"How should I describe the task?"	COPRO, MIPROv2, GEPA
Model weights	"Can I bake this into the model itself?"	BootstrapFinetune

Quick Reference: When to Use What

Before diving deep, here's the practical cheat sheet:

Few-Shot Demo Optimizers

LabeledFewShot - Simply grabs k random examples from training data. No validation, no magic.
BootstrapFewShot - The search algorithm goes through every training input in the trainset, gets a prediction and then checks whether it "passes" the metric. If it passes, the examples are added to the compiled program. It captures full traces including intermediate reasoning.
BootstrapFewShotWithRandomSearch - This creates many candidate programs by shuffling the training set and picking the best one. It tries different numbers of bootstrapped samples and evaluates each on the validation set.
KNNFewShot - Uses an in-memory KNN retriever to find the k nearest neighbors at test time. For each input, it identifies the most similar examples and attaches them as demonstrations.

Instruction Optimizers

COPRO - Uses coordinate ascent (hill-climbing) to optimize instructions. It generates multiple alternative instructions, evaluates them using your metric, and iteratively selects the best-performing ones.
MIPROv2 - Works in three stages: 1) Bootstrap few-shot examples. 2) Propose instruction candidates. 3) Find an optimized combination of both using Bayesian Optimization.
GEPA — Reflective evolution using natural language feedback. Analyzes WHY things failed, not just THAT they failed. Use when you need sample-efficient optimization or have rich feedback signals.

Weight Optimizer

BootstrapFinetune - A weight-optimizing variant that invokes a teacher program to produce reasoning, then uses those traces to fine-tune a student program.

1. LabeledFewShot

Optimizes: Few-shot Demonstrations Complexity: Simple Best For: Quick baseline, any data size

The Idea

LabeledFewShot is the simplest optimizer—it just grabs examples directly from your training data and prepends them to prompts.

Step-by-Step

Initialize: Specify k (number of examples to include in each prompt)
Random Selection: Randomly select k examples from your trainset
Format Examples: Convert selected examples into the format expected by your module's signature
Prepend to Prompt: These examples are prepended to every prompt sent to the LM

Key Characteristics:

No validation—examples are used as-is
Examples are "raw" (no intermediate reasoning steps)
Results may vary between runs due to random selection
No bootstrapping or augmentation

optimizer = LabeledFewShot(k=8)
compiled = optimizer.compile(student=your_program, trainset=trainset)

When It Works

Your examples are already high-quality. You need a baseline fast. You're testing whether few-shot helps at all before investing in fancier optimization.

When It Fails

Your training data has noise. Some examples are misleading. Different inputs need different types of examples.

2. BootstrapFewShot

Optimizes: Few-shot Demonstrations Complexity: General Best For: ~10 examples, when you need validated demonstrations

The Idea

BootstrapFewShot uses a "teacher" program to generate high-quality demonstrations that pass a metric, then uses these to train a "student" program.

Step-by-Step

Setup Teacher: If no teacher is provided, the student program itself becomes the teacher (optionally compiled with LabeledFewShot first)
Enable Tracing: Turn on compile mode to track all internal Predict calls
Iterate Through Training Examples:

for each example in trainset:
    if we_found_enough_bootstrapped_demos():
        break

    # Run teacher on example
    prediction = teacher(**example.inputs())

    # Capture traces of all internal module calls
    predicted_traces = dspy.settings.trace

Validate with Metric: For each prediction, check if it passes the metric:

if metric(example, prediction, predicted_traces):
    # This is a valid demonstration!
    for predictor, inputs, outputs in predicted_traces:
        demo = dspy.Example(automated=True, **inputs, **outputs)
        compiled_program[predictor_name].demonstrations.append(demo)

Collect Valid Demos: Only demonstrations that pass the metric are kept. This continues until max_bootstrapped_demos valid traces are collected.
Combine with Labeled: The final prompt includes:
- Up to max_labeled_demos raw examples from trainset
- Up to max_bootstrapped_demos validated, augmented traces

Key Parameters:

max_bootstrapped_demos=4: Augmented traces to generate
max_labeled_demos=16: Raw demos from training set
max_rounds=1: Attempts per example
metric: Function to validate outputs

3. BootstrapFewShotWithRandomSearch

Optimizes: Few-shot Demonstrations Complexity: Higher Best For: 50+ examples, when you want to find optimal demo combinations

The Idea

This optimizer runs BootstrapFewShot multiple times with different random configurations and selects the best-performing program.

###Step-by-Step:

Create Candidate Pool: Generate multiple candidate programs:
- Uncompiled (baseline) program
- LabeledFewShot optimized program
- BootstrapFewShot with unshuffled examples
- num_candidate_programs × BootstrapFewShot with randomized/shuffled examples

For Each Candidate:

for candidate_config in candidate_pool:
    # Shuffle training set differently each time
    shuffled_trainset = shuffle(trainset)

    # Vary number of bootstrapped demos
    num_demos = random_choice(range(1, max_bootstrapped_demos))

    # Compile with BootstrapFewShot
    candidate = BootstrapFewShot.compile(student, shuffled_trainset)

Evaluate All Candidates: Run each candidate program on the validation set and compute metric scores
Select Best: Return the program that achieved the highest score on the validation set

Key Parameters:

num_candidate_programs=10: Number of random configurations to try
All BootstrapFewShot parameters also apply

Why Random Search Helps: Different subsets of examples work better for different tasks. Random search explores this space efficiently without expensive exhaustive search.

4. KNNFewShot

Optimizes: Few-shot Demonstrations (dynamically) Complexity: Moderate Best For: When different inputs benefit from different examples

The Idea

Instead of using fixed examples, KNNFewShot dynamically selects the most relevant examples for each input using semantic similarity.

Step-by-Step

Initialize Embeddings:

# Create embedder (e.g., using SentenceTransformers)
embedder = Embedder(SentenceTransformer("all-MiniLM-L6-v2").encode)

# Embed all training examples
for example in trainset:
    example_vector = embedder.encode(example.inputs())
    store_in_index(example_vector, example)

At Inference Time (for each new input):

a. Convert input to vector:
   query_vector = embedder.encode(new_input)

b. Find k nearest neighbors in the training set index

c. Retrieve the k most similar training examples

d. Use these as demonstrations for THIS specific query

Dynamic Prompts: Each different input gets a customized prompt with the most relevant examples

Key Parameters:

k=3: Number of nearest neighbors to retrieve
trainset: Examples to search through
vectorizer: Embedding function

When to Use:

Your task has distinct "types" of inputs
Different categories benefit from different examples
One-size-fits-all demos don't work well

5. COPRO (Cooperative Prompt Optimization)

Optimizes: Instructions Complexity: Moderate Best For: Zero-shot optimization, when you need better task descriptions

How It Works

COPRO optimizes the instruction text (not examples) using coordinate ascent (hill-climbing).

Step-by-Step:

Initialize: Start with the current instruction for each predictor in your program

For Each Depth Iteration (outer loop):

for d in range(depth):
    temperature = init_temperature * (0.9 ** d)  # Cool down over time

Generate Candidates (inner loop for each predictor):

for each predictor in program:
    # Use LM to propose new instructions
    for i in range(breadth):
        new_instruction = prompt_model.generate(
            context = [
                "Previous instructions and scores",
                "Propose a better instruction"
            ],
            temperature = temperature
        )
        candidates.append(new_instruction)

Evaluate Each Candidate:

for candidate_instruction in candidates:
    # Create program copy with this instruction
    test_program = program.copy()
    test_program.predictor.instruction = candidate_instruction

    # Evaluate on training set
    score = evaluate(test_program, trainset, metric)

Select Best: Keep the instruction with the highest score
Coordinate Ascent: Repeat the process, using the history of tried instructions to inform new proposals

Key Parameters:

breadth=10: Number of candidate instructions per iteration
depth=3: Number of refinement iterations
init_temperature=1.4: Controls creativity (higher = more diverse)
prompt_model: LM used to generate instruction candidates

The Meta-Prompt Used:

"You are an instruction optimizer for large language models.
I will give some task instructions I've tried, along with their
corresponding validation scores. The instructions are arranged
in increasing order based on their scores, where higher scores
indicate better quality.

Your task is to propose a new instruction that will lead a good
language model to perform the task even better. Don't be afraid
to be creative."

6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)

Optimizes: Instructions AND Demonstrations (jointly) Complexity: High Best For: 200+ examples, when you want the best possible optimization

How It Works

MIPROv2 is the comprehensive optimizer. It jointly optimizes both instructions AND demonstrations using three stages and Bayesian optimization.

Stage 1: Bootstrap Few-Shot Examples

for i in range(num_candidates):
    candidate_demos = []
    for example in random_sample(trainset):
        output = program(**example.inputs())
        if metric(example, output) > metric_threshold:
            candidate_demos.append(trace)
        if len(candidate_demos) >= max_bootstrapped_demos:
            break
    demo_sets[i] = candidate_demos

Stage 2: Propose Instruction Candidates (Grounded Proposal)

This stage is "data-aware and demonstration-aware":

#### Gather context for instruction generation
context = {
    "dataset_summary": summarize_training_data(trainset),
    "program_code": inspect_dspy_program(program),
    "example_traces": bootstrapped_demos,
    "predictor_being_optimized": predictor_name
}

#### Generate multiple instruction candidates
for i in range(num_candidates):
    instruction = prompt_model.generate(
        f"Given this context: {context},
         write an effective instruction"
    )
    instruction_candidates.append(instruction)

Stage 3: Bayesian Optimization Search

for trial in range(num_trials):
    combo = bayesian_optimizer.suggest({
        "instruction_idx": range(num_candidates),
        "demo_set_idx": range(num_demo_sets)
    })

    test_program = build_program(
        instruction = instructions[combo.instruction_idx],
        demos = demo_sets[combo.demo_set_idx]
    )

    # Evaluate on minibatch
    score = evaluate(test_program, minibatch, metric)
    bayesian_optimizer.tell(combo, score)

Key Parameters:

num_candidates=7: Instruction/demo candidates to generate
num_trials=15-50: Bayesian optimization iterations
minibatch_size=25: Examples per trial evaluation
minibatch_full_eval_steps=10: How often to do full evaluation
auto="light"|"medium"|"heavy": Preset configurations

Why Bayesian Optimization?

Method	Strategy	Efficiency
Grid Search	Try every combination	Terrible for large spaces
Random Search	Try random combinations	Better, but blind
Bayesian	Learn from each trial, focus on promising areas	Smart

7. GEPA (Genetic-Pareto Reflective Optimizer)

Optimizes: Instructions (with rich feedback) Complexity: Sophisticated Data needed: ~10+ examples (sample-efficient!)

The Idea

GEPA is fundamentally different from other optimizers. Instead of blind trial-and-error, it reflects on failures. It asks: "WHY did this fail? What specifically went wrong? How can we fix it?"

This is like the difference between a student who just retakes tests hoping for better luck, versus one who reviews their mistakes and learns from them.

The Core Innovation: Reflection + Pareto Frontier

GEPA maintains a Pareto frontier of candidates — prompts that each excel at something. Rather than converging on one "best" prompt, it keeps diverse specialists alive.

Each iteration:

Sample a candidate from the Pareto frontier
Run it on a minibatch, collect traces AND feedback
Reflect: analyze what went wrong
Mutate: propose an improved instruction
Evaluate: if better, add to frontier

Step-by-Step

1. Sample from Pareto Frontier

# Candidates weighted by "coverage" - how many examples they're best on
candidate = weighted_sample(pareto_frontier)

2. Collect Traces + Feedback

for example in minibatch:
    output = candidate(**example.inputs())
    trace = capture_reasoning()
    feedback = metric.get_feedback(example, output)  # Rich text feedback!

3. Reflect on Failures

Here's the magic. GEPA sends this to a reflection LLM:

"Current instruction: [instruction]

Here's what happened:
- Example 1: FAILED. Feedback: 'Model didn't identify the variable
  to solve for, just computed 7+3 instead of recognizing x+3=7'
- Example 2: FAILED. Feedback: 'Multiplied instead of dividing'
- Example 3: SUCCESS.

Analyze what went wrong. Propose an improved instruction."

4. Mutate Based on Reflection

new_instruction = reflection_lm.generate(
    reflection_prompt + traces + feedback
)

5. Update Pareto Frontier

if new_candidate_is_non_dominated():
    pareto_frontier.add(new_candidate)
    remove_dominated_candidates()

Why GEPA Outperforms

Aspect	Traditional Optimizers	GEPA
Feedback	Scalar score (0.73)	Rich text ("failed because...")
Search	Random or gradient-based	Reflection-guided
Diversity	Converge to single best	Pareto frontier of specialists
Efficiency	Needs many rollouts	35× fewer rollouts than GRPO

Results from the paper:

Outperforms GRPO (reinforcement learning) by 10% on average, up to 20%
Uses up to 35× fewer rollouts
Outperforms MIPROv2 by 10%+ across benchmarks

When to Use GEPA

You have rich feedback signals (not just pass/fail)
Sample efficiency matters (limited budget)
Your task has diverse subtypes (Pareto frontier helps)
You want interpretable improvements (reflection explains changes)

Key Parameters:

auto="light"|"medium"|"heavy": Optimization intensity
reflection_lm: LLM used for reflection (can differ from task LLM)
reflection_minibatch_size=3: Examples per reflection round
skip_perfect_score=True: Don't reflect on successes

optimizer = dspy.GEPA(
    metric=my_metric_with_feedback,
    auto="medium",
    reflection_lm=dspy.LM("openai/gpt-4o")
)
compiled = optimizer.compile(student=program, trainset=trainset, valset=valset)

8. BootstrapFinetune

Optimizes: Model Weights Complexity: High Data needed: 200+ examples

The Idea

BootstrapFinetune is different from all the others. Instead of changing prompts, it changes the model itself. It distills a prompt-optimized program into actual weight updates.

Step-by-Step

Setup Teacher/Student:

teacher_program = optimized_large_model_program  # e.g., GPT-4o
student_program = program.deepcopy()
student_program.set_lm(small_model)  # e.g., Llama-7B

Bootstrap Training Data:

training_data = []

for example in trainset:
    # Run teacher program
    output = teacher_program(**example.inputs())

    # Capture full trace (input/output for each module)
    traces = get_all_module_traces()

    # Optionally filter with metric
    if metric is None or metric(example, output) > threshold:
        training_data.append(traces)

Create Multi-Task Dataset: Each module in the program becomes a separate "task":

finetune_dataset = []

for trace in training_data:
    for module_name, (input, output) in trace.items():
        finetune_dataset.append({
            "task": module_name,
            "input": input,
            "output": output
        })

Fine-tune the LM:

# Using standard LM fine-tuning
finetuned_lm = finetune(
    base_model = small_model,
    dataset = finetune_dataset,
    **train_kwargs  # epochs, learning_rate, etc.
)

Update Program: Replace LM in student program with finetuned version:
```
student_program.set_lm(finetuned_lm)
return student_program
```

Key Parameters:

metric: Optional filter for training traces
num_threads: Parallel processing for bootstrapping
train_kwargs: Fine-tuning hyperparameters (epochs, lr, etc.)

When to Use:

You've already optimized prompts with another optimizer
You need to run at scale (smaller models = cheaper)
Teacher model is expensive but available for data generation

Summary Comparison Table

Optimizer	Optimizes	Method	Data	Best For
LabeledFewShot	Demos	Random selection	Any	Quick baseline
BootstrapFewShot	Demos	Metric-filtered	~10	Validated demos
BootstrapFewShotWithRandomSearch	Demos	Random search	~50+	Optimal demo sets
KNNFewShot	Demos	Nearest neighbor	Any	Input-dependent demos
COPRO	Instructions	Coordinate ascent	~50+	Zero-shot optimization
MIPROv2	Both	Bayesian optimization	~200+	Comprehensive optimization
GEPA	Instructions	Reflective evolution	~10+	Sample-efficient, rich feedback
BootstrapFinetune	Weights	Gradient descent	~200+	Model distillation

Recommended Workflow

Start Simple: LabeledFewShot or BootstrapFewShot for quick baseline
Add Search: BootstrapFewShotWithRandomSearch if you have more data
Optimize Instructions: COPRO if you want zero-shot or clearer prompts
Go Comprehensive: MIPROv2 for good results with sufficient data
Reflect & Evolve: GEPA when you have rich feedback signals and want sample-efficient optimization
Distill: BootstrapFinetune to compress into smaller, faster models

The key insight is that these optimizers can be composed—run MIPROv2 or GEPA, then pass the result to BootstrapFinetune for a highly optimized, efficient program.

The Problem: Manual Prompt Engineering Doesn't Scale

What DSPy Optimizers Actually Do

Quick Reference: When to Use What

Few-Shot Demo Optimizers

Instruction Optimizers

Weight Optimizer

1. LabeledFewShot

The Idea

Step-by-Step

When It Works

When It Fails

2. BootstrapFewShot

The Idea

Step-by-Step

3. BootstrapFewShotWithRandomSearch

The Idea

4. KNNFewShot

The Idea

Step-by-Step

5. COPRO (Cooperative Prompt Optimization)

How It Works

6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)

How It Works

Why Bayesian Optimization?

7. GEPA (Genetic-Pareto Reflective Optimizer)

The Idea

The Core Innovation: Reflection + Pareto Frontier

Step-by-Step

Why GEPA Outperforms

When to Use GEPA

8. BootstrapFinetune

The Idea

Step-by-Step

Summary Comparison Table

Recommended Workflow

How Training and Validation Work in GEPA

Understanding GEPA - Prompt Evolution

The Problem: Manual Prompt Engineering Doesn't Scale

What DSPy Optimizers Actually Do

Quick Reference: When to Use What

Few-Shot Demo Optimizers

Instruction Optimizers

Weight Optimizer

1. LabeledFewShot

The Idea

Step-by-Step

When It Works

When It Fails

2. BootstrapFewShot

The Idea

Step-by-Step

3. BootstrapFewShotWithRandomSearch

The Idea

4. KNNFewShot

The Idea

Step-by-Step

5. COPRO (Cooperative Prompt Optimization)

How It Works

6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)

How It Works

Why Bayesian Optimization?

7. GEPA (Genetic-Pareto Reflective Optimizer)

The Idea

The Core Innovation: Reflection + Pareto Frontier

Step-by-Step

Why GEPA Outperforms

When to Use GEPA

8. BootstrapFinetune

The Idea

Step-by-Step

Summary Comparison Table

Recommended Workflow

How Training and Validation Work in GEPA

Understanding GEPA - Prompt Evolution