C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal, Rauno Arike

TL;DR
This paper introduces C2-Faith, a benchmark for evaluating LLMs as judges of chain-of-thought reasoning, focusing on causality and coverage, revealing their strengths and limitations in process faithfulness assessment.
Contribution
The paper presents C2-Faith, a novel benchmark with controlled perturbations to evaluate LLM judges on causality and coverage, highlighting their task-dependent reliability and failure modes.
Findings
Model rankings vary with task framing.
No single judge excels across all settings.
Coverage judgments tend to be inflated for incomplete reasoning.
Abstract
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Bayesian Modeling and Causal Inference
