ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Jingwei Ni; Ekaterina Fadeeva; Tianyi Wu; Mubashara Akhtar; Jiaheng Zhang; Elliott Ash; Markus Leippold; Timothy Baldwin; See-Kiong Ng; Artem Shelmanov; Mrinmaya Sachan

arXiv:2511.06209·cs.AI·April 24, 2026

ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

PDF

TL;DR

This paper introduces ReProbe, a lightweight probing method that uses internal LLM states to verify reasoning steps during test-time scaling, improving efficiency and performance across various domains.

Contribution

ReProbe presents a novel, efficient approach to step verification in LLMs by leveraging internal states, outperforming larger, costly verification models.

Findings

01

Probes match or outperform larger PRMs in multiple domains.

02

Probes are lightweight, with fewer than 10 million parameters.

03

Internal states encode confidence signals useful for reasoning verification.

Abstract

LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and selecting the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of a frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be provided either by a larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are lightweight, containing fewer than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.