RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Yunseok Han; Yejoon Lee; Jaeyoung Do

arXiv:2602.17053·cs.AI·February 24, 2026

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Yunseok Han, Yejoon Lee, Jaeyoung Do

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces RFEval, a benchmark for assessing reasoning faithfulness in large reasoning models through counterfactual interventions, revealing significant unfaithfulness especially in math and code tasks, and highlighting the weak correlation between accuracy and faithfulness.

Contribution

The paper proposes a formal framework for reasoning faithfulness, develops RFEval benchmark with over 7,000 instances, and provides empirical insights into factors affecting faithfulness in large reasoning models.

Findings

01

49.7% of model outputs are unfaithful

02

Faithfulness failures are concentrated in math and code domains

03

Adding RL-style objectives can improve faithfulness without affecting accuracy

Abstract

Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Tackles a timely question of reasoning output quality and consistency - Clear, testable behavioral definition of faithfulness separate from accuracy; simple, interpretable metrics. - Broad multi‑task coverage and 12‑model comparison; informative diagnostics by transition location and causality type.

Weaknesses

A. Granularity gap: Despite formal step‑wise notation, implementation evaluates coarse components (r/e/a), not per‑step CoT causality; this undermines a key motivation/contribution. B. Right‑censoring: Heavy reliance on contrast‑conditional filtering (δ=1) and exclusion of truncated or malformed outputs creates informative censoring; cross‑model comparability is not fully addressed. C. Evaluator bias/Confoundness : A single judge with low recall on flaw identification underpins major conclu

Reviewer 02Rating 4Confidence 2

Strengths

a. The paper provides a clear, operational definition of reasoning faithfulness grounded in causal influence and logical coherence. b. RFEval is carefully built with human-reviewed, subtly flawed counterfactual reasoning across diverse domains (math, code, law, etc.), enabling fine-grained diagnostics. c. This study performs large-scale evaluation of 12 open-source LRMs across 7 tasks.

Weaknesses

a. I personally find Section 2 hard to follow. Maybe the authors can add some examples to explain the idea. b. Answer for Q3 in Section 5 is not convincing, as the models use different architectures and are trained on different data. Training method may not be the only factor influencing reasoning faithfulness. c. The reasoning traces for the evaluation of faithfulness are different between models since they are filtered beforehand. The comparison is potentially not fair since they are evaluat

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper addresses reasoning faithfulness, an important yet underexplored topic. The proposed approach, probing faithfulness by performing intervention to the reasoning traces, is original. 2. The authors conduct extensive experiments and analyze the results from multiple perspectives, providing a comprehensive evaluation.

Weaknesses

1. Overall, I am not convinced that interventions on reasoning traces truly measure model faithfulness. As the authors note, “a faithful explanation should reflect a model’s internal reasoning process.” However, modifying the output reasoning trace is not equivalent to altering the internal reasoning process and may instead confuse the model. If the model does not believe the inserted reasoning, it may fail to continue coherently, but this does not necessarily mean it would be unfaithful when re

Code & Models

Datasets

snu-aidas/RFEval
dataset· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Topic Modeling