Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang

TL;DR
This study compares atomic decomposition and holistic prompting methods for LLM-based reference-grounded question answering evaluation, finding holistic prompts often perform better or equally well, especially in partial support detection.
Contribution
It provides a systematic comparison between self-decomposing atomic prompts and holistic prompts for reference-grounded QA evaluation across multiple benchmarks and models.
Findings
Holistic judges match or outperform atomic judges on two benchmarks.
Holistic advantage is mainly in detecting partially supported answers.
Reference quality degradation significantly impacts accuracy for both methods.
Abstract
Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
