RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do

TL;DR
This paper introduces RFEval, a benchmark for assessing reasoning faithfulness in large reasoning models through counterfactual interventions, revealing significant unfaithfulness especially in math and code tasks, and highlighting the weak correlation between accuracy and faithfulness.
Contribution
The paper proposes a formal framework for reasoning faithfulness, develops RFEval benchmark with over 7,000 instances, and provides empirical insights into factors affecting faithfulness in large reasoning models.
Findings
49.7% of model outputs are unfaithful
Faithfulness failures are concentrated in math and code domains
Adding RL-style objectives can improve faithfulness without affecting accuracy
Abstract
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with…
Peer Reviews
Decision·ICLR 2026 Poster
- Tackles a timely question of reasoning output quality and consistency - Clear, testable behavioral definition of faithfulness separate from accuracy; simple, interpretable metrics. - Broad multi‑task coverage and 12‑model comparison; informative diagnostics by transition location and causality type.
A. Granularity gap: Despite formal step‑wise notation, implementation evaluates coarse components (r/e/a), not per‑step CoT causality; this undermines a key motivation/contribution. B. Right‑censoring: Heavy reliance on contrast‑conditional filtering (δ=1) and exclusion of truncated or malformed outputs creates informative censoring; cross‑model comparability is not fully addressed. C. Evaluator bias/Confoundness : A single judge with low recall on flaw identification underpins major conclu
a. The paper provides a clear, operational definition of reasoning faithfulness grounded in causal influence and logical coherence. b. RFEval is carefully built with human-reviewed, subtly flawed counterfactual reasoning across diverse domains (math, code, law, etc.), enabling fine-grained diagnostics. c. This study performs large-scale evaluation of 12 open-source LRMs across 7 tasks.
a. I personally find Section 2 hard to follow. Maybe the authors can add some examples to explain the idea. b. Answer for Q3 in Section 5 is not convincing, as the models use different architectures and are trained on different data. Training method may not be the only factor influencing reasoning faithfulness. c. The reasoning traces for the evaluation of faithfulness are different between models since they are filtered beforehand. The comparison is potentially not fair since they are evaluat
1. The paper addresses reasoning faithfulness, an important yet underexplored topic. The proposed approach, probing faithfulness by performing intervention to the reasoning traces, is original. 2. The authors conduct extensive experiments and analyze the results from multiple perspectives, providing a comprehensive evaluation.
1. Overall, I am not convinced that interventions on reasoning traces truly measure model faithfulness. As the authors note, “a faithful explanation should reflect a model’s internal reasoning process.” However, modifying the output reasoning trace is not equivalent to altering the internal reasoning process and may instead confuse the model. If the model does not believe the inserted reasoning, it may fail to continue coherently, but this does not necessarily mean it would be unfaithful when re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Topic Modeling
