Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design

Xinyao Zhang; Nicole Sonne Heckmann; Manuela Del Castillo Suero; Francesco Paolo Speca; Maurizio Sessa

arXiv:2604.17988·cs.CL·April 21, 2026

Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design

Xinyao Zhang, Nicole Sonne Heckmann, Manuela Del Castillo Suero, Francesco Paolo Speca, Maurizio Sessa

PDF

TL;DR

This study compares general-purpose and biomedical large language models in supporting pharmacoepidemiologic study design, finding that general-purpose models with advanced prompting outperform biomedical models in relevance and reasoning.

Contribution

It provides a systematic evaluation of LLMs in pharmacoepidemiology, highlighting the effectiveness of prompt engineering and the superior performance of general-purpose models.

Findings

01

GPT-4o-LTM achieved median relevance score of 4 in most questions.

02

General-purpose LLMs outperformed biomedical LLMs in relevance and justification.

03

Prompt strategies significantly affected LLM performance.

Abstract

Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.