# Consistency–accuracy correlation in hard-prompted LLMs for entity and relation extraction: empirical findings from plant-health data

**Authors:** Xinzhi Yao, Claire Nédellec, Jingbo Xia, Robert Bossy

PMC · DOI: 10.1186/s44342-025-00063-2 · Genomics & Informatics · 2026-02-10

## TL;DR

This paper studies how accurate and consistent large language models are when extracting information from plant-health data, finding that these traits vary by model and task complexity.

## Contribution

The study introduces a method to distinguish recoverable from critical output variations in LLMs for information extraction tasks.

## Key findings

- Accuracy and consistency in LLMs show a model-dependent and task-complexity-related correlation.
- Consistency should be measured semantically, not just by format or wording variations.
- Recoverable variations can be filtered without losing meaningful information.

## Abstract

As large language models (LLMs) become increasingly popular for information extraction (IE), concerns persist regarding the stability and reliability of their outputs. While accuracy has traditionally been the main evaluation metric, consistency—defined as the stability of model outputs across repeated runs—has recently been proposed as a complementary signal of reliability. In this work, we examine the relationship between accuracy and consistency in hard-prompted generative LLMs applied to entity and relation extraction. We conduct a systematic evaluation using four LLMs (GPT, DeepSeek, Qwen, Kimi) on the EPOP corpus, a plant-health dataset with rich entity types, long-range relations, overlapping relations, and strong argument constraints. To refine the interpretation of consistency, we distinguish between recoverable output variations—those that preserve the meaning of the extracted information—and critical ones that result in semantic errors. Our results show that while some positive correlation between accuracy and consistency exists, it is model-dependent and varies with task complexity. In structured prediction tasks, we show that consistency should be measured at the semantic level, ignoring superficial variations in format or wording. These insights have important implications for using self-consistency as a confidence filter and for designing reliable generative IE pipelines in specialized domains.

The online version contains supplementary material available at 10.1186/s44342-025-00063-2.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12888769/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12888769/full.md

## References

10 references — full list in the complete paper: https://tomesphere.com/paper/PMC12888769/full.md

---
Source: https://tomesphere.com/paper/PMC12888769