Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young

arXiv:2603.20172·cs.CL·March 25, 2026

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young

PDF

Open Access 1 Datasets

TL;DR

This paper demonstrates that the measurement of faithfulness in language models' reasoning traces varies significantly depending on the classifier used, challenging the notion of faithfulness as an objective property.

Contribution

It provides empirical evidence that faithfulness metrics are classifier-dependent and systematically inconsistent, highlighting the need for multiple evaluation methods.

Findings

01

Classifier-based faithfulness rates vary widely across models.

02

Significant systematic disagreements between different classifiers.

03

Classifier choice can reverse model rankings in faithfulness assessments.

Abstract

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper provides evidence that it is not. Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce faithfulness rates of 74.4%, 82.6%, and 69.7%. Per-model gaps range from 2.6 to 30.6 percentage points; all pairwise McNemar tests are significant (p < 0.001). The disagreements are systematic: Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

richardyoung/cot-faithfulness-open-models
dataset· 450 dl
450 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbodied and Extended Cognition · Mental Health Research Topics · Child and Animal Learning Development