Prompt Optimisation: An Introduction to DSPy

Written by Dean Kennedy | Oct 24, 2025 3:15:32 PM

What is Prompt Optimisation?

Prompt optimisation is the process of systematically improving the prompts given to a language model to achieve better results for a given task. Typically, prompt engineering relies on good intuition and techniques such as few-shot prompting which can give prompt engineering a “trial-and-error” feel.

Prompt optimisation techniques treat prompt designs like machine learning problems. Given available data, we tune prompts to optimise performance metrics. This solves a key issue with prompt engineering: it can be time consuming as many iterations may be required to produce the best performance. Furthermore, models and requirements change over time, so prompts may break over time and lead to degraded performance.

Introduction to DSPy

DSPy (Declarative Self-improving Python) is an open-source framework from Stanford University which allows users to programmatically define LLM pipelines instead of manually prompting. Tasks are defined in a declarative way: by specifying input and output fields using a Signature. DSPy handles the construction of the prompt under the hood using templated best practices (such as formatting and adding instructions) to generate consistent, instruction-following prompts. DSPy provides Modules for techniques such as chain-of-thought and ReAct which can be built into pipelines whilst separating the pipeline logic from the prompt wording itself.

How DSPy Optimises Prompts

DSPy optimisers take three things as input:

DSPy program: This is the pipeline which solves a problem (question answering, classification etc.)
Metrics: Ways to evaluate the program’s output given inputs and returns a usable score. Metrics can be accuracy, F1, or even RAG-based metrics.
Training examples: A small set of validated results. Even datasets with 5-20 entries can be enough for certain optimisations.

Given these inputs, users can implement different optimisers to search for prompt improvements. These consist of:

Generating few-shot examples: Bootstrap methods create examples to be included in the prompt. They typically run the model on training data to collect output traces for the highest quality input/output pairs to use.
Instructions and wording: Algorithms such as GEPA and MIPROv2 propose new instructions or phrasing. They typically use search algorithms that use LLM-based reflections for mutating candidates and uses task-specific contextual feedback to guide the search process.
Fine-tuning: Optimisers can also adjust model weights based on data collected via prompting.

Importantly, these methods are not mutually exclusive and can be combined and adapted to any LM workflow.

Code Example

Below is an example of implementing DSPy on Databricks using Meta’s Llama-3.1-8B-Instruct. We use the TREC-6 dataset (six-way question classification).

Module setup

Import dspy
lm = dspy.LM(model='databricks/databricks-meta-llama-3-1-8b-instruct')
dspy.configure(lm=lm)

Signature

Then we define our DSPy signature. This declares the intended behaviour of the DSPy module. This tells the language model what to do, rather than how it needs to do it.

class TrecClassify(dspy.Signature):

"""Classify the question into exactly one of the labels.

Choose from: ABBR, DESC, ENTY, HUM, LOC, NUM.

Respond with ONLY the label string (ABBR/DESC/ENTY/HUM/LOC/NUM), nothing else.

"""

question: str = dspy.InputField()

label: str = dspy.OutputField(desc="One of ABBR, DESC, ENTY, HUM, LOC, NUM")

Module

Then, we define our DSPy module. The prompts (and parameters) of the module are tuned by the optimiser. The module is invoked to process inputs and return outputs.

class TrecClassifier(dspy.Module):

def __init__(self):

super().__init__()

self.classify = dspy.Predict(TrecClassify)

def forward(self, question: str):

return self.classify(question=question)

Metric

Here we define simple functions defining our simple metric, accuracy. The accuracy score is what we use to tune the module.

def accuracy_metric(example, prediction):

return 1.0 if (example.label == prediction.label) else 0.0

Optimiser

Now, we define our optimiser using our metric. We are using the MIPROv2 (Multiprompt Instruction Proposal Optimiser) which optimises both instructions and few-shot examples. We then evaluate

from dspy.teleprompt import MIPROv2

opt = MIPROv2(metric=accuracy_metric, num_candidates=32)

compiled = opt.compile(student=TrecClassifier(), trainset=trainset, valset=devset[:150], num_trials=48)

Results

Method	Accuracy	Difference to Baseline
Zero-shot	0.202	-
BootstrapFewShot	0.165	-0.037
MIPROv2	0.367	0.165

This basic pipeline resulted in an 82% increase! For more complex problems with a higher level of domain knowledge required, it is clear to see how optimising the prompt could lead to vastly improved results. A Bootstrap optimiser was also used, though the results were slightly worse than the baseline. This is likely due to low-quality demos or a biased training split.

Pros & Cons of Prompt Optimisation

Prompt optimisation offers several benefits over standard prompt engineering. As shown in the small example, we achieved an improvement in performance over a standard prompt through a simple training pipeline. Through implementing high-quality few-shot prompts, along with tuned instructions, it is easy to see the clear benefits for tasks of increasing complexity.

A major benefit is the reduction in labour-intensive trial-and-error prompt engineering. Automating what was once a repetitive guessing game is now done in a sophisticated manner which improves results but importantly frees up developers to focus on other design aspects such as the solution architecture.

As prompt optimisation libraries such as DSPy use a systematic and data-driven approach, prompts can be rigorously evaluated and different methodologies compared. This is vital for deploying GenAI solutions through A/B tests. Furthermore, as requirements and models change over time, a DSPy program can be re-optimised easily to incorporate the new requirements or factor in the new model behaviour.

Of course, there are drawbacks to prompt optimisation techniques. Firstly, it introduces a larger computational overhead in the form of model API calls which increases cost and training time. It is also possible to have diminished returns for small datasets. Whilst prompt optimisation is effective for few examples, poorly chosen metrics or biased data could cause the optimiser to overfit. Crucially, DSPy in particular is opaque as prompts are hidden behind abstractions; it saves you from dealing with the iterations of prompts but you also lose control. DSPy does allow insight into optimised prompts but it’s not trivial.

Conclusion

Prompt optimisation libraries such as DSPy turn prompt engineering from an art into more of a science. DSPy does this by providing a rich framework to programmatically define LLM behaviour and systematically improve it through a data-driven approach. Treating prompts as tuneable parameters to be optimised can greatly improve performance whilst reducing time spent developing prompts over time. As GenAI applications evolve, frameworks like DSPy could make prompt optimisation as routine as hyperparameter tuning in traditional ML.

That said, prompt optimisation must be weighed against the costs of increased compute overhead and costs, learning a new framework and potentially small gains for the additional time and cost. Overall, for complicated tasks which require a high level of domain knowledge, prompt optimisation could be a viable solution.

Exploring prompt optimization for your LLM projects? Our team helps organisations implement systematic approaches to GenAI development - from framework selection to production deployment. Get in touch to discuss your use case.

View full post