A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
Teo Susnjak

TL;DR
This paper introduces a reproducible calibration workflow for prompt-based large language models in evidence synthesis, emphasizing transparency, transferability, and systematic optimization.
Contribution
It presents a structured, metric-guided prompt calibration protocol that separates task rules from prompt framing, with validation on screening tasks using DSPy and GEPA tools.
Findings
Calibration workflow improves prompt performance on screening tasks
Using a smaller student LLM with a larger reflection LLM enhances optimization
Artefact preservation facilitates reproducibility and transferability
Abstract
This methods article presents a reproducible calibration workflow for prompt-based large language models (LLMs) in structured evidence-synthesis tasks. The method separates the rules that define the scientific task from the mutable prompt harness that frames and applies them. It optimises that harness against labelled or reference examples and an explicit task metric, then preserves the calibrated workflow as an inspectable artefact with its specification, metric, settings, and evaluation traces. The example code instantiates the protocol with DSPy and GEPA tools, but the underlying logic can transfer to other prompt-optimisation frameworks that support structured task definitions, metric-guided search, and artefact reuse. Title and abstract screening is the worked validation case because it provides labelled benchmark data and clear evaluation metrics. The demonstrated workflow uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
