When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
Laksh Advani

TL;DR
This paper exposes a reliability crisis in small language models, where many correct answers are based on flawed reasoning, and introduces a process verification metric to improve trustworthiness.
Contribution
It introduces the Reasoning Integrity Score (RIS), a process-based metric validated across diverse tasks, and analyzes the effects of retrieval augmentation and meta-cognition on reasoning quality.
Findings
50-69% of correct answers contain flawed reasoning
Retrieval-augmented generation improves reasoning integrity significantly
Meta-cognitive interventions can harm performance in small models
Abstract
Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69\% of correct answers from these models contain fundamentally flawed reasoning -- a ``Right-for-Wrong-Reasons'' phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement (). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen's --), meta-cognitive interventions like self-critique often harm performance ( to ) in small models on the evaluated tasks. Mechanistic analysis reveals RAG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
