Ginger Labs
BlogsCareers

Ginger Labs

Ginger Labs © 2026
Terms of servicePrivacy notice

On this page

  • The Problem: Manual Prompt Engineering Doesn't Scale
  • What DSPy Optimizers Actually Do
  • Quick Reference: When to Use What
  • 1. LabeledFewShot
  • 2. BootstrapFewShot
  • 3. BootstrapFewShotWithRandomSearch
  • 4. KNNFewShot
  • 5. COPRO (Cooperative Prompt Optimization)
  • 6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)
  • 7. GEPA (Genetic-Pareto Reflective Optimizer)
  • 8. BootstrapFinetune
  • Summary Comparison Table
  • Recommended Workflow
Back

DSPy Optimization Algorithms: Step-by-Step Guide

A detailed breakdown of how each DSPy optimizer works internally.

14 Jan, 2026•Ish Rajesh Shelley•DSPy Optimization•15 min read
tags:
dspyoptimizationalgorithmsaimachine-learning
DSPy Optimization Algorithms: Step-by-Step Guide

The Problem: Manual Prompt Engineering Doesn't Scale

You've written a prompt. It works... sometimes. So you tweak it. Test it. Tweak it again. Add "think step by step." Remove it. Add examples. Remove them. Try different phrasings.

After hours of this, you've maybe improved accuracy by 5%. And you have no idea if that improvement will hold on new data.

This is the prompt engineering treadmill. DSPy's optimizers are designed to get you off it.

What DSPy Optimizers Actually Do

Every DSPy optimizer answers the same question: "Given my task and some training data, what should my prompt look like?"

But they attack it from three different angles:

ApproachWhat Gets OptimizedOptimizers
Few-shot demos"What examples should I show?"LabeledFewShot, BootstrapFewShot, BootstrapFewShotWithRandomSearch, KNNFewShot
Instructions"How should I describe the task?"COPRO, MIPROv2, GEPA
Model weights"Can I bake this into the model itself?"BootstrapFinetune

Quick Reference: When to Use What

Before diving deep, here's the practical cheat sheet:

Few-Shot Demo Optimizers

  • LabeledFewShot - Simply grabs k random examples from training data. No validation, no magic.
  • BootstrapFewShot - The search algorithm goes through every training input in the trainset, gets a prediction and then checks whether it "passes" the metric. If it passes, the examples are added to the compiled program. It captures full traces including intermediate reasoning.
  • BootstrapFewShotWithRandomSearch - This creates many candidate programs by shuffling the training set and picking the best one. It tries different numbers of bootstrapped samples and evaluates each on the validation set.
  • KNNFewShot - Uses an in-memory KNN retriever to find the k nearest neighbors at test time. For each input, it identifies the most similar examples and attaches them as demonstrations.

Instruction Optimizers

  • COPRO - Uses coordinate ascent (hill-climbing) to optimize instructions. It generates multiple alternative instructions, evaluates them using your metric, and iteratively selects the best-performing ones.
  • MIPROv2 - Works in three stages: 1) Bootstrap few-shot examples. 2) Propose instruction candidates. 3) Find an optimized combination of both using Bayesian Optimization.
  • GEPA — Reflective evolution using natural language feedback. Analyzes WHY things failed, not just THAT they failed. Use when you need sample-efficient optimization or have rich feedback signals.

Weight Optimizer

  • BootstrapFinetune - A weight-optimizing variant that invokes a teacher program to produce reasoning, then uses those traces to fine-tune a student program.

1. LabeledFewShot

Optimizes: Few-shot Demonstrations Complexity: Simple Best For: Quick baseline, any data size

The Idea

LabeledFewShot is the simplest optimizer—it just grabs examples directly from your training data and prepends them to prompts.

Step-by-Step

  1. Initialize: Specify k (number of examples to include in each prompt)
  2. Random Selection: Randomly select k examples from your trainset
  3. Format Examples: Convert selected examples into the format expected by your module's signature
  4. Prepend to Prompt: These examples are prepended to every prompt sent to the LM

Key Characteristics:

  • No validation—examples are used as-is
  • Examples are "raw" (no intermediate reasoning steps)
  • Results may vary between runs due to random selection
  • No bootstrapping or augmentation
optimizer = LabeledFewShot(k=8)
compiled = optimizer.compile(student=your_program, trainset=trainset)

When It Works

Your examples are already high-quality. You need a baseline fast. You're testing whether few-shot helps at all before investing in fancier optimization.

When It Fails

Your training data has noise. Some examples are misleading. Different inputs need different types of examples.


2. BootstrapFewShot

Optimizes: Few-shot Demonstrations Complexity: General Best For: ~10 examples, when you need validated demonstrations

The Idea

BootstrapFewShot uses a "teacher" program to generate high-quality demonstrations that pass a metric, then uses these to train a "student" program.

Step-by-Step

  1. Setup Teacher: If no teacher is provided, the student program itself becomes the teacher (optionally compiled with LabeledFewShot first)

  2. Enable Tracing: Turn on compile mode to track all internal Predict calls

  3. Iterate Through Training Examples:

for each example in trainset:
    if we_found_enough_bootstrapped_demos():
        break

    # Run teacher on example
    prediction = teacher(**example.inputs())

    # Capture traces of all internal module calls
    predicted_traces = dspy.settings.trace
  1. Validate with Metric: For each prediction, check if it passes the metric:

    if metric(example, prediction, predicted_traces):
        # This is a valid demonstration!
        for predictor, inputs, outputs in predicted_traces:
            demo = dspy.Example(automated=True, **inputs, **outputs)
            compiled_program[predictor_name].demonstrations.append(demo)
    
  2. Collect Valid Demos: Only demonstrations that pass the metric are kept. This continues until max_bootstrapped_demos valid traces are collected.

  3. Combine with Labeled: The final prompt includes:

    • Up to max_labeled_demos raw examples from trainset
    • Up to max_bootstrapped_demos validated, augmented traces

Key Parameters:

  • max_bootstrapped_demos=4: Augmented traces to generate
  • max_labeled_demos=16: Raw demos from training set
  • max_rounds=1: Attempts per example
  • metric: Function to validate outputs

Why "Bootstrapped" Demos Are Special: Unlike labeled demos, bootstrapped demos include intermediate reasoning (chain-of-thought, retrieved context, etc.) because they capture the full trace of what the teacher did.


3. BootstrapFewShotWithRandomSearch

Optimizes: Few-shot Demonstrations Complexity: Higher Best For: 50+ examples, when you want to find optimal demo combinations

The Idea

This optimizer runs BootstrapFewShot multiple times with different random configurations and selects the best-performing program.

###Step-by-Step:

  1. Create Candidate Pool: Generate multiple candidate programs:

    • Uncompiled (baseline) program
    • LabeledFewShot optimized program
    • BootstrapFewShot with unshuffled examples
    • num_candidate_programs × BootstrapFewShot with randomized/shuffled examples
  2. For Each Candidate:

    for candidate_config in candidate_pool:
        # Shuffle training set differently each time
        shuffled_trainset = shuffle(trainset)
    
        # Vary number of bootstrapped demos
        num_demos = random_choice(range(1, max_bootstrapped_demos))
    
        # Compile with BootstrapFewShot
        candidate = BootstrapFewShot.compile(student, shuffled_trainset)
    
  3. Evaluate All Candidates: Run each candidate program on the validation set and compute metric scores

  4. Select Best: Return the program that achieved the highest score on the validation set

Key Parameters:

  • num_candidate_programs=10: Number of random configurations to try
  • All BootstrapFewShot parameters also apply

Why Random Search Helps: Different subsets of examples work better for different tasks. Random search explores this space efficiently without expensive exhaustive search.


4. KNNFewShot

Optimizes: Few-shot Demonstrations (dynamically) Complexity: Moderate Best For: When different inputs benefit from different examples

The Idea

Instead of using fixed examples, KNNFewShot dynamically selects the most relevant examples for each input using semantic similarity.

Step-by-Step

  1. Initialize Embeddings:

    # Create embedder (e.g., using SentenceTransformers)
    embedder = Embedder(SentenceTransformer("all-MiniLM-L6-v2").encode)
    
    # Embed all training examples
    for example in trainset:
        example_vector = embedder.encode(example.inputs())
        store_in_index(example_vector, example)
    
  2. At Inference Time (for each new input):

a. Convert input to vector:
   query_vector = embedder.encode(new_input)

b. Find k nearest neighbors in the training set index

c. Retrieve the k most similar training examples

d. Use these as demonstrations for THIS specific query
  1. Dynamic Prompts: Each different input gets a customized prompt with the most relevant examples

Key Parameters:

  • k=3: Number of nearest neighbors to retrieve
  • trainset: Examples to search through
  • vectorizer: Embedding function

When to Use:

  • Your task has distinct "types" of inputs
  • Different categories benefit from different examples
  • One-size-fits-all demos don't work well

5. COPRO (Cooperative Prompt Optimization)

Optimizes: Instructions Complexity: Moderate Best For: Zero-shot optimization, when you need better task descriptions

How It Works

COPRO optimizes the instruction text (not examples) using coordinate ascent (hill-climbing).

Step-by-Step:

  1. Initialize: Start with the current instruction for each predictor in your program

  2. For Each Depth Iteration (outer loop):

    for d in range(depth):
        temperature = init_temperature * (0.9 ** d)  # Cool down over time
    
  3. Generate Candidates (inner loop for each predictor):

    for each predictor in program:
        # Use LM to propose new instructions
        for i in range(breadth):
            new_instruction = prompt_model.generate(
                context = [
                    "Previous instructions and scores",
                    "Propose a better instruction"
                ],
                temperature = temperature
            )
            candidates.append(new_instruction)
    
  4. Evaluate Each Candidate:

    for candidate_instruction in candidates:
        # Create program copy with this instruction
        test_program = program.copy()
        test_program.predictor.instruction = candidate_instruction
    
        # Evaluate on training set
        score = evaluate(test_program, trainset, metric)
    
  5. Select Best: Keep the instruction with the highest score

  6. Coordinate Ascent: Repeat the process, using the history of tried instructions to inform new proposals

Key Parameters:

  • breadth=10: Number of candidate instructions per iteration
  • depth=3: Number of refinement iterations
  • init_temperature=1.4: Controls creativity (higher = more diverse)
  • prompt_model: LM used to generate instruction candidates

The Meta-Prompt Used:

"You are an instruction optimizer for large language models.
I will give some task instructions I've tried, along with their
corresponding validation scores. The instructions are arranged
in increasing order based on their scores, where higher scores
indicate better quality.

Your task is to propose a new instruction that will lead a good
language model to perform the task even better. Don't be afraid
to be creative."

6. MIPROv2 (Multiprompt Instruction Proposal Optimizer v2)

Optimizes: Instructions AND Demonstrations (jointly) Complexity: High Best For: 200+ examples, when you want the best possible optimization

How It Works

MIPROv2 is the comprehensive optimizer. It jointly optimizes both instructions AND demonstrations using three stages and Bayesian optimization.

Stage 1: Bootstrap Few-Shot Examples

for i in range(num_candidates):
    candidate_demos = []
    for example in random_sample(trainset):
        output = program(**example.inputs())
        if metric(example, output) > metric_threshold:
            candidate_demos.append(trace)
        if len(candidate_demos) >= max_bootstrapped_demos:
            break
    demo_sets[i] = candidate_demos

Stage 2: Propose Instruction Candidates (Grounded Proposal)

This stage is "data-aware and demonstration-aware":

#### Gather context for instruction generation
context = {
    "dataset_summary": summarize_training_data(trainset),
    "program_code": inspect_dspy_program(program),
    "example_traces": bootstrapped_demos,
    "predictor_being_optimized": predictor_name
}

#### Generate multiple instruction candidates
for i in range(num_candidates):
    instruction = prompt_model.generate(
        f"Given this context: {context},
         write an effective instruction"
    )
    instruction_candidates.append(instruction)

Stage 3: Bayesian Optimization Search

for trial in range(num_trials):
    combo = bayesian_optimizer.suggest({
        "instruction_idx": range(num_candidates),
        "demo_set_idx": range(num_demo_sets)
    })

    test_program = build_program(
        instruction = instructions[combo.instruction_idx],
        demos = demo_sets[combo.demo_set_idx]
    )

    # Evaluate on minibatch
    score = evaluate(test_program, minibatch, metric)
    bayesian_optimizer.tell(combo, score)

Key Parameters:

  • num_candidates=7: Instruction/demo candidates to generate
  • num_trials=15-50: Bayesian optimization iterations
  • minibatch_size=25: Examples per trial evaluation
  • minibatch_full_eval_steps=10: How often to do full evaluation
  • auto="light"|"medium"|"heavy": Preset configurations

Why Bayesian Optimization?

MethodStrategyEfficiency
Grid SearchTry every combinationTerrible for large spaces
Random SearchTry random combinationsBetter, but blind
BayesianLearn from each trial, focus on promising areasSmart

7. GEPA (Genetic-Pareto Reflective Optimizer)

Optimizes: Instructions (with rich feedback) Complexity: Sophisticated Data needed: ~10+ examples (sample-efficient!)

The Idea

GEPA is fundamentally different from other optimizers. Instead of blind trial-and-error, it reflects on failures. It asks: "WHY did this fail? What specifically went wrong? How can we fix it?"

This is like the difference between a student who just retakes tests hoping for better luck, versus one who reviews their mistakes and learns from them.

The Core Innovation: Reflection + Pareto Frontier

GEPA maintains a Pareto frontier of candidates — prompts that each excel at something. Rather than converging on one "best" prompt, it keeps diverse specialists alive.

Each iteration:

  1. Sample a candidate from the Pareto frontier
  2. Run it on a minibatch, collect traces AND feedback
  3. Reflect: analyze what went wrong
  4. Mutate: propose an improved instruction
  5. Evaluate: if better, add to frontier

Step-by-Step

1. Sample from Pareto Frontier

# Candidates weighted by "coverage" - how many examples they're best on
candidate = weighted_sample(pareto_frontier)

2. Collect Traces + Feedback

for example in minibatch:
    output = candidate(**example.inputs())
    trace = capture_reasoning()
    feedback = metric.get_feedback(example, output)  # Rich text feedback!

3. Reflect on Failures

Here's the magic. GEPA sends this to a reflection LLM:

"Current instruction: [instruction]

Here's what happened:
- Example 1: FAILED. Feedback: 'Model didn't identify the variable
  to solve for, just computed 7+3 instead of recognizing x+3=7'
- Example 2: FAILED. Feedback: 'Multiplied instead of dividing'
- Example 3: SUCCESS.

Analyze what went wrong. Propose an improved instruction."

4. Mutate Based on Reflection

new_instruction = reflection_lm.generate(
    reflection_prompt + traces + feedback
)

5. Update Pareto Frontier

if new_candidate_is_non_dominated():
    pareto_frontier.add(new_candidate)
    remove_dominated_candidates()

Why GEPA Outperforms

AspectTraditional OptimizersGEPA
FeedbackScalar score (0.73)Rich text ("failed because...")
SearchRandom or gradient-basedReflection-guided
DiversityConverge to single bestPareto frontier of specialists
EfficiencyNeeds many rollouts35× fewer rollouts than GRPO

Results from the paper:

  • Outperforms GRPO (reinforcement learning) by 10% on average, up to 20%
  • Uses up to 35× fewer rollouts
  • Outperforms MIPROv2 by 10%+ across benchmarks

When to Use GEPA

  • You have rich feedback signals (not just pass/fail)
  • Sample efficiency matters (limited budget)
  • Your task has diverse subtypes (Pareto frontier helps)
  • You want interpretable improvements (reflection explains changes)

Key Parameters:

  • auto="light"|"medium"|"heavy": Optimization intensity
  • reflection_lm: LLM used for reflection (can differ from task LLM)
  • reflection_minibatch_size=3: Examples per reflection round
  • skip_perfect_score=True: Don't reflect on successes
optimizer = dspy.GEPA(
    metric=my_metric_with_feedback,
    auto="medium",
    reflection_lm=dspy.LM("openai/gpt-4o")
)
compiled = optimizer.compile(student=program, trainset=trainset, valset=valset)

8. BootstrapFinetune

Optimizes: Model Weights Complexity: High Data needed: 200+ examples

The Idea

BootstrapFinetune is different from all the others. Instead of changing prompts, it changes the model itself. It distills a prompt-optimized program into actual weight updates.

Step-by-Step

  1. Setup Teacher/Student:

    teacher_program = optimized_large_model_program  # e.g., GPT-4o
    student_program = program.deepcopy()
    student_program.set_lm(small_model)  # e.g., Llama-7B
    
  2. Bootstrap Training Data:

    training_data = []
    
    for example in trainset:
        # Run teacher program
        output = teacher_program(**example.inputs())
    
        # Capture full trace (input/output for each module)
        traces = get_all_module_traces()
    
        # Optionally filter with metric
        if metric is None or metric(example, output) > threshold:
            training_data.append(traces)
    
  3. Create Multi-Task Dataset: Each module in the program becomes a separate "task":

    finetune_dataset = []
    
    for trace in training_data:
        for module_name, (input, output) in trace.items():
            finetune_dataset.append({
                "task": module_name,
                "input": input,
                "output": output
            })
    
  4. Fine-tune the LM:

    # Using standard LM fine-tuning
    finetuned_lm = finetune(
        base_model = small_model,
        dataset = finetune_dataset,
        **train_kwargs  # epochs, learning_rate, etc.
    )
    
  5. Update Program: Replace LM in student program with finetuned version:

    student_program.set_lm(finetuned_lm)
    return student_program
    

Key Parameters:

  • metric: Optional filter for training traces
  • num_threads: Parallel processing for bootstrapping
  • train_kwargs: Fine-tuning hyperparameters (epochs, lr, etc.)

When to Use:

  • You've already optimized prompts with another optimizer
  • You need to run at scale (smaller models = cheaper)
  • Teacher model is expensive but available for data generation

Summary Comparison Table

OptimizerOptimizesMethodDataBest For
LabeledFewShotDemosRandom selectionAnyQuick baseline
BootstrapFewShotDemosMetric-filtered~10Validated demos
BootstrapFewShotWithRandomSearchDemosRandom search~50+Optimal demo sets
KNNFewShotDemosNearest neighborAnyInput-dependent demos
COPROInstructionsCoordinate ascent~50+Zero-shot optimization
MIPROv2BothBayesian optimization~200+Comprehensive optimization
GEPAInstructionsReflective evolution~10+Sample-efficient, rich feedback
BootstrapFinetuneWeightsGradient descent~200+Model distillation

Recommended Workflow

  1. Start Simple: LabeledFewShot or BootstrapFewShot for quick baseline
  2. Add Search: BootstrapFewShotWithRandomSearch if you have more data
  3. Optimize Instructions: COPRO if you want zero-shot or clearer prompts
  4. Go Comprehensive: MIPROv2 for good results with sufficient data
  5. Reflect & Evolve: GEPA when you have rich feedback signals and want sample-efficient optimization
  6. Distill: BootstrapFinetune to compress into smaller, faster models

The key insight is that these optimizers can be composed—run MIPROv2 or GEPA, then pass the result to BootstrapFinetune for a highly optimized, efficient program.

You might also like

How Training and Validation Work in GEPA

How Training and Validation Work in GEPA

Ish Rajesh Shelley
26 Jan, 2026• 18 min read
gepaoptimizationtrainingvalidationprompt-engineering
Understanding GEPA - Prompt Evolution

Understanding GEPA - Prompt Evolution

Ish Rajesh Shelley
20 Jan, 2026• 12 min read
gepaprompt-evolutionprompt-engineeringoptimizationai