Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis
Zhiyin Tan, Jennifer D'Souza

TL;DR
This paper evaluates the ability of large language models to perform structured evidence extraction for meta-analyses, revealing significant limitations in relational and numerical accuracy that hinder reliable automation.
Contribution
It introduces a diagnostic framework and evaluation protocol to systematically assess LLMs' structural fidelity in evidence extraction for meta-analysis, highlighting key failure modes.
Findings
Performance drops sharply with complex relational tasks
Long-context inputs worsen extraction reliability
Systematic structural errors hinder accurate meta-analytic data extraction
Abstract
Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Biomedical Text Mining and Ontologies
