Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

Teo Susnjak

arXiv:2509.00038·cs.CL·September 3, 2025

Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

Teo Susnjak

PDF

Open Access

TL;DR

This paper introduces a reproducible, structured workflow for AI-assisted systematic literature reviews that replaces manual prompt crafting with automated prompt optimization, enhancing reliability and transparency.

Contribution

It adapts declarative prompt optimisation techniques for SLR automation, providing a domain-specific framework with code for verifiable, transparent LLM pipelines.

Findings

01

Demonstrates applicability of prompt optimisation to SLR

02

Provides a reproducible blueprint with code examples

03

Enhances transparency and reliability in evidence synthesis

Abstract

Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour…

Tables1

Table 1. Table 1: Examples of LLM prompt-induced performance swings in SLR tasks

Screening
SLR Phase & Study	Prompt Engineering Fragility Example
Shah et al. [8]	Accuracy fluctuated by up to 28.3% across prompts
Dennstädt et al. [9]	Sensitivity/specificity swung from 94.5%/31.8% (Flan-T5) to 81.9%/75.2% (Mixtral) across LLM families and prompts.
Cao et al. [10]	Reordering or omitting few-shot exemplars altered include/exclude decisions for 15% of abstracts.
Trad et al. [11]	Tightening exclusion thresholds in prompts dropped abstract-error-rate (AER) from 72.1% to 50.7%.
Cao et al. [12]	Zero-shot prompting achieved sensitivity 16.7–87.5%, vs. optimised prompt 86.7–100.0%.
Data Extraction
Cao et al. [13]	Accuracy swung by 15 percentage points across different prompts.
Lai et al. [14]	LLM-assisted extraction rose from 95.1% to 97.9%, driven by optimised prompts
Li et al. [4]	Recall for extracting study details varied between 64% and 92% across different LLMs and prompting strategies.
Khraisha et al. [15]	Accuracy spanned 60% to near-perfect levels (>95%) depending on prompts
Risk of Bias / Quality Assessment
Wang et al. [16]	Agreement with clinical guidelines acutely ranged from kappa = –0.002 to 0.984 under different prompting styles, compromising compliance judgments.
Lai et al. [17]	RoB 1 accuracy shifted from 84.5% to 89.5% across different LLMs and prompt configurations; domain accuracies swung 56.7–98.3%.
Lai et al. [14]	LLM-only RoB 1 accuracy was 95.7–96.9%; domain scores varied from 87.9% to 100%.
Eisele-Metzger et al. [18]	Overall RoB2 LLM agreement was 41% (kappa=0.22; domain kappa range 0.10–0.31), varying with prompt phrasing and domain focus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Meta-analysis and systematic reviews

Full text

Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

Teo Susnjak

School of Mathematical and Computational Sciences

Massey University

Albany, New Zealand Corresponding author: [email protected]

Abstract

Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.

Keywords Systematic Literature Review Automation, Evidence Synthesis, Large Language Models, Reproducibility, Prompt Engineering, Context Engineering, Prompt Optimisation, Prompt Compilation, AI-driven Research Automation

What is already known

•

Systematic literature reviews (SLRs) are foundational for evidence-based practice but are notoriously slow and resource-intensive.

•

Large language models (LLMs) show significant promise for automating SLR tasks, but their performance is highly sensitive to the phrasing of input prompts.

•

This "prompt fragility" makes LLM-assisted workflows unreliable, difficult to reproduce, and raises concerns about their scientific validity.

What is new

•

This paper introduces a declarative framework that adapts state-of-the-art prompt optimisation techniques from the general AI field and applies them specifically to the domain of SLR automation.

•

It replaces manual, ad-hoc "prompt alchemy" with a rigorous, four-step programmatic process: (1) formally defining the research goal, (2) codifying the quality standard with data, (3) automatically compiling an optimal prompt, and (4) packaging the result as a verifiable digital artefact.

•

The study provides both a conceptual blueprint and a functional, code example, demonstrating a practical pathway to building robust and reproducible LLM pipelines for evidence synthesis.

Potential impact for research automation and synthesis researchers

•

This work provides researchers and methodologists with a clear, actionable methodology to harness the speed of LLMs without sacrificing the scientific rigour and reproducibility that are cornerstones of evidence synthesis.

•

The proposed framework offers a path toward establishing new standards for transparency and auditability in AI-assisted reviews, allowing others to verify and replicate automated steps with precision.

•

It lays the groundwork for a future ecosystem of modular, verifiable, and reusable AI components for all stages of an SLR, empowering the community to build more trustworthy and efficient tools for research synthesis.

1 Introduction

Large language models (LLMs) now offer a promising path for automating systematic literature reviews (SLRs) [1]. Recent studies show that models have remarkable potential to automate all phases of an SLR process, from abstracts screening, data extraction, quality assessment, through to evidence syntheses with promising accuracy [1, 2]. Despite progress, serious concerns persist. LLM outputs can change significantly with small variations in LLM prompts [3]. Model updates can invalidate previously crafted prompts and break reliable pipelines, while cross-model behaviour and accuracy diverges significantly using identical prompts [4]. These LLM sensitivities and fragilities erode trust and widen the reproducibility gap in SLRs and scientific endeavours more broadly. Therefore, there is a need to explore more systematic approaches to the automation of SLRs that ensure reliability and repeatability when employing LLMs for evidence synthesis.

2 Prompt Engineering Crisis

The remarkable reasoning abilities of LLMs have resulted in natural language (NL) becoming framed as a new programming language[5], implying that NLs have become the primary interface for instructing AI systems, with generative models acting as compilers that translate these instructions into correct actions or executable code. However, this ignores the consequences of the ambiguity of NLs which do not have the rigidity and precision of highly syntactic programming languages. Humans rely on context to resolve meaning while programming removes ambiguity and assumptions through precise instructions. Treating LLM prompts as code collapses this difference. Since LLMs do not have the ability to produce same outputs for identical inputs, this results in the current state where prompt engineering has transformed into prompt alchemy. LLM outputs are highly non-deterministic, influenced by ad hoc prompt design choices, sequencing of instructions and slight rephrasings, as well as luck [6]. Subtle variations in prompt formats can result in differences as large as 76 accuracy points [3] on tasks from the Super-NaturalInstructions benchmark. The evidence of this brittleness is also growing in the SLR automation field [7, 1]. While studies identify LLM fragilities, none propose a reproducible, programmatic remedy offering deterministic and more accurate outputs. Table 1 summarises recent works using LLMs to automate different SLR phases, revealing their variability.

3 A Declarative Framework for Reliable and Reproducible SLR Automation

The brittleness of prompt engineering requires a paradigm shift from ad hoc design approaches to programmatic rigour. This work proposes the use of declarative prompt tuning approaches for future SLR automation research, inspired by recent advances in prompt optimisation such as the DSPy[19], GRPO[20] and GEPA[21] frameworks. This study presents a return to a programmatic, and a declarative paradigm specifically, aiming to restore reproducibility in LLM-driven SLR research by decoupling the researcher’s scientific intent (the “what”), from the model’s specific implementation (the “how”). Instead of relying on fragile, hand-crafted prompts, this approach treats LLM workflows for SLR tasks as language model (LM) programs that can be compiled. This compilation entails an automated process that systematically searches for a high-performing LLM-agnostic prompt configuration that satisfies a predefined quality standard or accuracy requirement. This methodology is operationalised through four key components depicted and described in Figure 1 and applicable to all stages of an SLR process. For a concrete illustration, the framework is translated to an example for the abstract screening process below:

Box 1: Blueprint for a Verifiable Screening Module via Declarative LM-program Tuning

Abstract Screening Example

1. Define the Goal:

Screening abstracts for inclusion.

•

Task Declaration Inputs follow a fixed schema {title, abstract, keywords}. The label space is {Include, Exclude, Unsure}. Unsure is treated as Include for safety or routed to human review, and this policy is fixed in the spec.

•

Context Engineering A versioned context file states the PICO criteria, study designs in scope, and the review questions.

2. Codify the Standard

Define a machine-testable target.

•

Gold-Standard Examples Curate $N$ expert-labeled abstracts not part of the study, representing each possible classification outcome.

•

Evaluation Metric Primary metric is accuracy.

3. Compile the Program

Run controlled search over prompts and exemplars.

•

The compiler explores instruction templates and up to $k$ few-shot exemplars under a pinned model build with temperature=0, fixed seed, and a set budget $B$ evaluations. All runs log hashes of data, prompts, model ID, and decoding parameters.

4. Package the Artefact

Emit a shareable, auditable bundle.

•

The bundle contains config.yaml (task spec and run controls), prompt.txt and exemplars.json, metrics.json with the test-set results, and a run log. A short mapping lists which PRISMA items the bundle supports, such as protocol transparency and decision traceability. Results are verifiable and recomputable under a pinned environment.

The conceptual blueprint detailed in Box 1 can be directly implemented using modern programmatic LLM frameworks. To further demonstrate how this can be operationalised, a functional Python implementation of the above illustrated abstract screening module using the DSPy (MIPROv2) library is provided in Appendix A. A MIPROv2 example and an equivalent example using the latest GEPA implementation are made available on Google Colab. The code example in the Appendix A demonstrates a declarative goal definition, a gold-standard dataset and a metric function to codify the standard, an optimiser that compiles the LM-program, as well as how the verifiable artefact is packaged and used to make classifications on new abstracts. This automated compilation process is analogous to the “hyperparameter tuning” in machine learning, in that it uses a validation dataset and a metric to systematically search for an optimal configuration. However, instead of tuning architectural parameters, this framework tunes NL artefacts which comprise the instructions and few-shot examples that guide a fixed, pre-trained model. This positions the method as a rigorous, data-driven alternative to manual prompt engineering, meeting the high scientific standards required for evidence synthesis. This process is can be seen as analogous to the “hyperparameter tuning” process in machine learning that ensures that “near”-optimal parameters are derived for a given model and dataset prior to use; thus, this approach is rigorous and meets high scientific requirements that manual prompt engineering does not. This probing study calls on researchers in the SLR automation field to empirically investigate this approach and available supporting framework implementations, and thus provide a tangible pathway for researchers to adopt this rigorous and reproducible methodology for fully automating SLRs in a PRISMA-compliant manner.

4 Conclusion

LLMs offer a compelling opportunity to streamline the labour-intensive processes involved in systematic literature reviews, but the current reliance on brittle prompt engineering introduces unacceptable risks to reproducibility and scientific rigour. This exploratory study proposes an alternative: a declarative framework for SLR automation that adapts recent advances in prompt optimisation into a structured, testable, and version-controlled workflow. The contribution here lies in applying declarative prompt tuning approaches, originally developed for general LLM tasks—to the domain of SLR automation, and demonstrating their utility through a proof-of-concept, reproducible implementation. To the best of current knowledge, this represents the first application of such declarative techniques to evidence synthesis workflows. The contribution here lies in adapting emerging declarative prompt optimisation techniques to SLR automation and demonstrating their utility through a reproducible implementation. This prototyping work provides both a conceptual blueprint and a working implementation, illustrating how LLM-assisted evidence synthesis can evolve from fragile, ad hoc prompting into verifiable, auditable, and modular pipelines. Future work should fully test and expand this approach to other SLR stages, integrate it into standard reporting frameworks, and foster a community-driven ecosystem of reusable components for transparent, AI-enabled reviews.

Appendix A: Technical Example of a Screening Module for Prompt Optimisation

The conceptual blueprint described in Box 1 can be implemented using programmatic LLM tools like DSPy. The following Python code provides a minimal functional example of how an AbstractScreening module would be structured, tested, and compiled into a verifiable, reproducible digital artefact, translating the declarative framework to the corresponding components of the underlying tool. A more complete example of the code below can be found on a Google Colab notebook.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Lieberum et al. [2025] Judith-Lisa Lieberum, Markus Töws, Maria-Inti Metzendorf, Felix Heilmeyer, Waldemar Siemens, Christian Haverkamp, Daniel Böhringer, Joerg J. Meerpohl, and Angelika Eisele-Metzger. Large language models for conducting systematic reviews: on the rise, but not yet ready for use—a scoping review. Journal of Clinical Epidemiology , 181:111746, 2025. doi: 10.1016/j.jclinepi.2025.111746 . · doi ↗
2Susnjak et al. [2025] Teo Susnjak, Peter Hwang, Napoleon Reyes, Andre L. C. Barczak, Timothy Mc Intosh, and Surangika Ranathunga. Automating research synthesis with domain-specific large language model fine-tuning. ACM Trans. Knowl. Discov. Data , 19(3), March 2025. ISSN 1556-4681. doi: 10.1145/3715964 . URL https://doi.org/10.1145/3715964 . · doi ↗
3Sclar et al. [2023] Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. ar Xiv preprint ar Xiv:2310.11324 , 2023.
4Li et al. [2025] Lingbo Li, Anuradha Mathrani, and Teo Susnjak. What level of automation is "good enough"? a benchmark of large language models for meta-analysis data extraction, 2025. URL https://arxiv.org/abs/2507.15152 .
5Kearns [2023] Michael Kearns. Responsible ai in the generative era, May 2023. URL https://www.amazon.science/blog/responsible-ai-in-the-generative-era . Accessed: 2025-08-01.
6Razavi et al. [2025] Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. Benchmarking prompt sensitivity in large language models. In European Conference on Information Retrieval , pages 303–313. Springer, 2025.
7Staudinger et al. [2024] Moritz Staudinger, Wojciech Kusa, Florina Piroi, Aldo Lipani, and Allan Hanbury. A reproducibility and generalizability study of large language models for query generation. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages 186–196, 2024.
8Shah et al. [2024] Aaditya Shah, Shridhar Mehendale, and Siddha Kanthi. Efficacy of large language models for systematic reviews. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM) , pages 29–35. IEEE, 2024.