Beyond Factual Accuracy: Evaluating Global Reasoning Integrity in RAG Systems with LogicScore
Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

TL;DR
This paper introduces LogicScore, a global reasoning evaluation framework for RAG systems that assesses logical integrity beyond factual accuracy, revealing significant reasoning gaps in current models.
Contribution
We propose LogicScore, a Horn Rule-based global reasoning evaluation method that systematically measures completeness, essentiality, and determinateness in long-form answer generation.
Findings
Leading models excel in factual accuracy but lag in reasoning quality.
Gemini-3 Pro achieves 92.85% factual precision but only 35.11% in essentiality.
Our evaluation highlights the critical need for reasoning-focused metrics in LLM development.
Abstract
Current evaluation methods for Retrieval Augmented Generation (RAG) suffer from \textit{factual myopia}: they relentlessly emphasize factual accuracy yet neglect global logical integrity in long-form answer generation. This drives models to force unnatural connections, producing factually grounded yet logically incoherent responses with unaddressed gaps, ambiguous links, or redundant premises. To mitigate this, we present \textsc{LogicScore}, shifting from local, fact-by-fact assessment to rigorous global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Essentiality} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
