TL;DR
PiCSAR is a training-free method that improves reasoning accuracy by scoring candidate solutions using joint log-likelihood, effectively identifying correct reasoning chains without ground-truth answers.
Contribution
It introduces PiCSAR, a novel scoring approach based on joint log-likelihood that enhances reasoning model performance without additional training.
Findings
PiCSAR achieves over 10-point improvements on benchmarks.
It outperforms baselines with at least 2x fewer samples in most cases.
Correct reasoning chains have higher confidence scores, validating PiCSAR's effectiveness.
Abstract
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct…
| Method | SVAMP | GSM8K | MATH500 | GPQA-Diamond | TheoremQA | |||||
| Gemma-2-9B-Instruct | ||||||||||
| Greedy Decoding | ||||||||||
| Self-Consistency | ||||||||||
| USC | - | - | - | - | - | |||||
| p(True) | ||||||||||
| Self-Certainty | ||||||||||
| \rowcolorblue!10 PiCSAR | ||||||||||
| Upper Bound | ||||||||||
| Llama-3.1-8B-Instruct | ||||||||||
| Greedy Decoding | ||||||||||
| Self-Consistency | ||||||||||
| USC | - | - | - | - | - | |||||
| p(True) | ||||||||||
| Self-Certainty | ||||||||||
| \rowcolorblue!10 PiCSAR | ||||||||||
| Upper Bound | ||||||||||
| Qwen3-8B (Non-thinking) | ||||||||||
| Greedy Decoding | 27.71 | |||||||||
| Self-Consistency | ||||||||||
| USC | - | - | - | - | - | |||||
| p(True) | ||||||||||
| Self-Certainty | ||||||||||
| \rowcolorblue!10 PiCSAR | ||||||||||
| Upper Bound | ||||||||||
| Llama-3.1-70B-Instruct | ||||||||||
| Greedy Decoding | ||||||||||
| Self-Consistency | ||||||||||
| USC | - | - | - | - | - | |||||
| p(True) | ||||||||||
| Self-Certainty | ||||||||||
| \rowcolorblue!10 PiCSAR | ||||||||||
| Upper Bound | ||||||||||
| Qwen3-32B (Non-thinking) | ||||||||||
| Greedy decoding | ||||||||||
| Self-consistency | ||||||||||
| USC | - | - | - | - | - | |||||
| p(True) | ||||||||||
| Self-certainty | ||||||||||
| \rowcolorblue!10 PiCSAR | ||||||||||
| Upper Bound | ||||||||||
| Method | SVAMP | GSM8K | MATH500 | GPQA-Diamond | TheoremQA | AIME 2024 | AIME 2025 |
|---|---|---|---|---|---|---|---|
| DS-Distill-llama-3-8B | |||||||
| Average | |||||||
| Self-Consistency | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| Upper Bound | |||||||
| DS-Distill-Qwen-2.5-7B | |||||||
| Average | |||||||
| Self-Consistency | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| Upper Bound | |||||||
| Qwen3-8B | |||||||
| Average | |||||||
| Self-Consistency | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| Upper Bound | |||||||
| Rank | Peaks | Sent. | Ratio (%) | Acc. (%) |
| Llama-3.1-8B | ||||
| Highest | 1.88 | 16.4 | 14.8 | 53.3 |
| Middle | 2.00 | 22.9 | 12.8 | 48.8 |
| Lowest | 2.47 | 64.7 | 08.6 | 44.2 |
| Llama-3.1-70B | ||||
| Highest | 1.80 | 14.1 | 15.5 | 63.7 |
| Middle | 1.83 | 19.9 | 13.0 | 60.4 |
| Lowest | 3.08 | 38.4 | 10.8 | 59.4 |
| Qwen3-8B | ||||
| Highest | 1.99 | 15.8 | 17.6 | 73.7 |
| Middle | 1.91 | 17.6 | 17.0 | 72.8 |
| Lowest | 2.18 | 26.4 | 14.2 | 69.4 |
| Qwen3-32B | ||||
| Highest | 1.48 | 11.6 | 22.4 | 77.0 |
| Middle | 1.57 | 12.0 | 19.4 | 76.8 |
| Lowest | 1.76 | 25.1 | 16.1 | 72.6 |
| Gemma-2-9B | ||||
| Highest | 1.46 | 08.5 | 24.5 | 46.5 |
| Middle | 1.38 | 10.0 | 19.0 | 44.0 |
| Lowest | 1.20 | 11.6 | 14.3 | 41.6 |
| Method | SC | USC | Self-Cert. | PiCSAR |
|---|---|---|---|---|
| Full Reasoning Chain | ✓ | ✓ | ✓ | |
| Model Confidence | ✓ | ✓ | ✓ | |
| Computationally Efficient | ✓ | ×∗ | ✓ | ✓ |
| Smaller Model Capable | ✓ | ✓ | ✓ | |
| ∗Due to context length | ||||
| Prompt | Accuracy |
|---|---|
| Prompt 1 | 54.60% |
| Prompt 2 | 54.00% |
| Prompt 3 | 54.20% |
| Prompt 4 | 54.40% |
| Prompt 5 | 54.40% |
| Method | SVAMP | GSM8K | MATH500 | TheoremQA | ||||
|---|---|---|---|---|---|---|---|---|
| Gemma-2-9B-Instruct | ||||||||
| CISC (p(True) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 CISC (PiCSAR) | ||||||||
| Upper Bound | ||||||||
| Llama-3.1-8B-Instruct | ||||||||
| CISC (p(True)) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 CISC (PiCSAR) | ||||||||
| Upper Bound | ||||||||
| Qwen3-8B (Non-thinking) | ||||||||
| CICS (p(True)) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 CICS (PiCSAR) | ||||||||
| Upper Bound | ||||||||
| Llama-3.1-70B-Instruct | ||||||||
| CISC (p(True)) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 CISC (PiCSAR) | ||||||||
| Upper Bound | ||||||||
| Qwen3-32B (Non-thinking) | ||||||||
| CICS (P-True) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 CICS (PiCSAR) | ||||||||
| Upper Bound | ||||||||
| Method | SVAMP | GSM8K | MATH500 | GPQA-Diamond | ||||
|---|---|---|---|---|---|---|---|---|
| Gemma-2-9B-Instruct | ||||||||
| Reasoning Confidence | ||||||||
| Answer Confidence | ||||||||
| Reasoning confidence (normalised) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 PiCSAR-N | ||||||||
| Upper Bound | ||||||||
| Llama-3.1-8B-Instruct | ||||||||
| Reasoning Confidence | ||||||||
| Answer Confidence | ||||||||
| Reasoning confidence (normalised) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 PiCSAR-N | ||||||||
| Upper Bound | ||||||||
| Qwen3-8B (Non-thinking) | ||||||||
| Reasoning Confidence | ||||||||
| Answer Confidence | ||||||||
| Reasoning Confidence (normalised) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 PiCSAR-N | ||||||||
| Upper Bound | ||||||||
| Llama-3.1-70B-Instruct | ||||||||
| Reasoning Confidence | ||||||||
| Answer Confidence | ||||||||
| Reasoning Confidence (normalised) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 PiCSAR-N | ||||||||
| Upper Bound | ||||||||
| Qwen3-32B (Non-thinking) | ||||||||
| Reasoning confidence | ||||||||
| Answer confidence | ||||||||
| Reasoning Confidence (normalised) | ||||||||
| \rowcolorblue!10 PiCSAR | ||||||||
| \rowcolorblue!10 PiCSAR-N | ||||||||
| Upper Bound | ||||||||
| Method | AIME 2024 | AIME 2025 | MATH500 | SVAMP | GSM8K | GPQA-Diamond | |
|---|---|---|---|---|---|---|---|
| DS-Distill-llama-3-8B | |||||||
| Reasoning Confidence | |||||||
| Reasoning Confidence (Normalised) | |||||||
| Answer Confidence | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| \rowcolorblue!10 PiCSAR-N | |||||||
| Upper Bound | |||||||
| DS-Distill-Qwen-2.5-7B | |||||||
| Reasoning Confidence | |||||||
| Reasoning Confidence (Normalised) | |||||||
| Answer Confidence | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| \rowcolorblue!10 PiCSAR-N | |||||||
| Upper Bound | |||||||
| Qwen3-8B | |||||||
| Reasoning Confidence | |||||||
| Reasoning Confidence (Normalised) | |||||||
| Answer Confidence | |||||||
| \rowcolorblue!10 PiCSAR | |||||||
| \rowcolorblue!10 PiCSAR-N | |||||||
| Upper Bound | |||||||
| Accuracy | |
|---|---|
| 53.40% | |
| 53.40% | |
| 53.40% |
| Samples | PiCSAR (%) | Self-Consistency (%) |
|---|---|---|
| 6 | 89.11 | 88.15 |
| 10 | 89.89 | 88.56 |
| 16 | 89.89 | 88.11 |
| 32 | 90.22 | 88.89 |
| Model | Metric | -value | -stat | Mean (C / I) | ||
|---|---|---|---|---|---|---|
| LLaMA-3.1-8B | 4.57 | 38441 | 0.41 | / | ||
| 9.11 | 45115 | 0.82 | / | |||
| LLaMA-3.1-70B | 5.76 | 41596 | 0.54 | / | ||
| 6.99 | 39096 | 0.66 | / | |||
| Gemma-2-9B | 9.03 | 42086 | 0.81 | / | ||
| 9.03 | 45831 | 0.81 | / | |||
| Qwen3-8B | 5.37 | 36835 | 0.54 | / | ||
| 5.17 | 31131 | 0.52 | / | |||
| Qwen3-32B | 6.09 | 34500 | 0.64 | / | ||
| 4.98 | 27660 | 0.52 | / | |||
| Think-Qwen3-8B | 4.97 | 27177 | 0.56 | / | ||
| 2.67 | 21190 | 0.30 | / | |||
| Think-DS-R1 Distill-Qwen-7B | 3.87 | 29105 | 0.39 | / | ||
| 2.04 | 29023 | 0.21 | / | |||
| Think-DS-R1 Distill-LLaMA-8B | 5.99 | 39822 | 0.57 | / | ||
| 4.63 | 31908 | 0.44 | / |
| Method | Gemma-2-9B | Qwen3-8B | Llama-3.1-70B | DS-Qwen-7B | Average |
|---|---|---|---|---|---|
| Skywork-Reward-V2-Llama-3.1-8B | |||||
| LMUnit-qwen2.5-72B | |||||
| \rowcolorblue!10 PiCSAR |
| Method | Gemma-2-9B | Qwen3-8B | Llama-3.1-70B | DS-Qwen-7B | Average |
|---|---|---|---|---|---|
| Skywork-Reward-V2-Llama-3.1-8B | |||||
| LMUnit-qwen2.5-72B | |||||
| \rowcolorblue!10 PiCSAR |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
PiCSAR: Probabilistic Confidence Selection
and Ranking for Reasoning Chains
Joshua Ong Jun Leang1,2 Zheng Zhao2 Aryo Pradipta Gema2 Sohee Yang3
Wai-Chung Kwan2 Xuanli He3 Wenda Li2 Pasquale Minervini2,4
Eleonora Giunchiglia1 Shay B. Cohen2
1Imperial College London 2University of Edinburgh 3UCL 4Miniml.AI
{j.ong25,e.giunchiglia}@imperial.ac.uk [email protected]
Abstract
Best-of- sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. A key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. This method uses both the scores of the reasoning path (reasoning confidence) and the final answer (answer confidence). PiCSAR achieves substantial gains across several benchmarks ( on AIME2024, on AIME2025), outperforming baselines with at least 2x fewer samples in 20 out of 25 comparisons. Our analysis reveals that correct reasoning chains exhibit higher reasoning and answer confidence levels, justifying the effectiveness of PiCSAR 111Code: https://github.com/joshuaongg21/PiCSAR.
**PiCSAR: Probabilistic Confidence Selection
and Ranking for Reasoning Chains**
Joshua Ong Jun Leang1,2 Zheng Zhao2 Aryo Pradipta Gema2 Sohee Yang3
Wai-Chung Kwan2 Xuanli He3 Wenda Li2 Pasquale Minervini2,4
Eleonora Giunchiglia1 Shay B. Cohen2
1Imperial College London 2University of Edinburgh 3UCL 4Miniml.AI
{j.ong25,e.giunchiglia}@imperial.ac.uk [email protected]
1 Introduction
Recent studies have shown that LLMs achieve strong performance on complex reasoning tasks (Grattafiori et al., 2024; Team et al., 2024; Hurst et al., 2024). Techniques such as Chain-of-Thought (CoT; Wei et al., 2022; Kojima et al., 2022) aim to enhance the reasoning process by generating explicit intermediate reasoning steps. Building on these advances, large reasoning models (LRMs), LLMs that receive intensive reasoning‑focused post‑training, such as DeepSeek‑R1 (Guo et al., 2025) and Qwen3 (Yang et al., 2025a), solve complex problems by generating long CoT reasoning traces. These traces are often extended via test‑time scaling (Muennighoff et al., 2025) and can include reflective self‑checking (Yang et al., 2025b).
Despite these advances, classic decoding approaches such as greedy decoding often fall short of state-of-the-art performance on complex benchmarks (Team et al., 2025; Balunović et al., 2025), emphasising the need for more sophisticated inference-time strategies.
Best-of-N (BoN) sampling (Stiennon et al., 2020) emerged as an important technique, where candidate responses are generated, and the highest-scoring one is selected via a reward model (Mudgal et al., 2024; Huang et al., 2025). However, training external reward models can be computationally expensive (Wang et al., 2023a) and vulnerable to distribution shifts (Eisenstein et al., 2023).
This led to the adoption of simpler, training-free BoN variants, such as Self-Consistency (Wang et al., 2023b), which selects the most frequent answer among multiple generated outputs. However, a key limitation of Self-Consistency is its exclusive reliance on the final answer while ignoring the reasoning that leads to it. Extensions such as Universal Self-Consistency (USC; Chen et al., 2023b) prompt the model to identify the most consistent response from a set of candidates. However, USC focuses on majority agreement over full responses, overlooking reasoning-level signals critical to answer quality, such as coherence and plausibility. USC is further constrained by context-window size and reasoning ability (Chen et al., 2023b), proving particularly ineffective with smaller models (Kang et al., 2025). Attempts to overcome this by prompting the model to self-evaluate are often ineffective, as explicit confidence is often poorly calibrated (Miao et al., 2024; Taubenfeld et al., 2025).
To address these challenges, we introduce Probabilistic Confidence Selection And Ranking (PiCSAR), a probabilistic confidence method for selecting a reasoning chain together with its corresponding answer without requiring any additional training or fine-tuning. Our approach is straightforward to implement and can be used with any LLM or LRM as an inference-time tool. It is based on a new scoring function that, given a prompt , selects a reasoning chain and the answer by maximising their joint conditional likelihood . This objective naturally separates into two complementary components. The reasoning confidence term promotes high-probability reasoning sequences by implicitly evaluating the likelihood of the chain given the prompt. The answer confidence term quantifies the model’s certainty in its final prediction, conditioned on the generated reasoning chain. Figure˜2 shows a high-level outline of PiCSAR, and how it can solve instances that Self-Consistency and USC cannot solve correctly.
We evaluate PiCSAR on reasoning tasks across five LLMs and three LRMs, outperforming Self-Consistency and USC in most cases. PiCSAR achieves these gains with far fewer samples, often requiring only samples to beat baselines using samples. PiCSAR substantially improves LRM performance, with Deepseek-R1-distilled-Llama-3 gaining +13.33% and +7.58% over Self-Consistency on AIME2024 and GPQA-Diamond, respectively (Figure˜1). Unlike USC, which is bounded by the model’s reasoning abilities, PiCSAR decouples confidence estimation, allowing smaller models to effectively capture stable reasoning process properties rather than model artefacts (§5.3).
Beyond empirical results, we provide a comprehensive analysis of LLM confidence behaviour. At finer granularity, we analyse answer confidence at the sentence level using information density, defined as the ratio of peak-confidence instances to sentence count (peak-to-sentence ratio), which measures how frequently a reasoning chain attains high confidence relative to its length. We find that higher accuracy correlates with high information density within model families (§5.1). In addition, we show that answer confidence positively correlates with downstream accuracy (§5.2).
2 A Joint Probabilistic Method for Reasoning Chain Selection
We propose a training-free method for selecting a reasoning chain from a set of candidates, grounded in a probabilistic framework that leverages the model’s confidence as its scoring signal. We frame the selection problem as an approximation of maximum a posteriori (MAP) decoding over the joint space of reasoning chains and final answers.
2.1 Scoring Function and Log-likelihood Decomposition
We denote by a set of possible prompts, a set of reasoning chains, and the set of possible final answers. For a given input prompt , our goal is to find the high-confidence reasoning chain and its corresponding answer . Consider a selection criterion that aims to identify the pair with the highest joint conditional probability, . By the chain rule of probability, this decomposes into two distinct components:
[TABLE]
In log-space, the joint probability becomes the sum of two log-likelihood terms as follows:
[TABLE]
These two terms provide complementary signals regarding the quality of a candidate generation:
- •
Reasoning Confidence (): This term quantifies the model’s confidence in generating given the prompt . It quantifies the plausibility of the reasoning path itself.
- •
Answer Confidence (): measures the model’s certainty in the answer , conditioned on the reasoning chain it has produced.
2.2 Probabilistic Confidence Selection And Ranking (PiCSAR)
Directly selecting , , where the joint log likelihood Score is maximised over the space of possible pairs, is intractable. We therefore approximate this optimisation with our PiCSAR sampling-based approach, as outlined in Algorithm 1. We first generate candidate reasoning chains from the model’s posterior . Each chain implies a corresponding final answer . We then re-rank these candidates using the PiCSAR scoring function.
The reasoning confidence term is obtained by summing the token-level log-probabilities from the model during the generation of . By not applying length normalisation, this term naturally favours more concise and direct reasoning paths as it involves a cumulative sum of individual token log-probabilities. We also consider the length-normalised variant, PiCSAR-N, which focuses more on the impact of log probability per token rather than favouring concise reasoning paths, leading to similar results (details in Appendix C.3).
The answer confidence term, , however, presents a practical challenge. As the model’s distribution is over all possible text continuations, the probability of a final answer is confounded by the likelihood of whatever text might follow it. This makes the raw log-probabilities of different answers fundamentally incomparable. To address this and ensure we can reliably extract a final answer for answer confidence computation, we condition the model on an explicit instruction prompt, denoted as , which is appended after the reasoning chain. This prompt explicitly asks the model to provide the final answer based on the preceding context (i.e., “When you see a potential reasoning followed by , output the final answer.”), with details of the prompt provided in Appendix˜B. While we extract the answer directly from the reasoning chain , we use this augmented prompt to compute the answer confidence.
Our modified objective is thus:
[TABLE]
Methodological Departure from Standard MAP Decoding. While the decomposition in Equation˜3 relies on the foundational chain rule, PiCSAR fundamentally differs from standard Maximum A Posteriori (MAP) decoding or beam search. In standard continuous decoding, the joint probability of a CoT sequence is disproportionately dominated by the arbitrary length and local perplexity of the reasoning steps, effectively drowning out the signal of the final deductive answer. This limitation has historically driven the field away from likelihood-based scoring for reasoning tasks, favouring majority voting or externally trained reward models. By introducing the instructional intervention , PiCSAR breaks this continuous autoregressive evaluation. It explicitly forces the model to evaluate the logical entailment of the answer independently of the generative probability of the preceding text. This isolates the conditional answer confidence, turning Equation˜2 into Equation˜3 and thereby yielding a robust, training-free ranking mechanism.
The final step is to select the candidate pair with the highest score. As illustrated in Figure 2, the two components of our scoring function play complementary roles. The reasoning confidence is the sum of log-probabilities for every token in the reasoning chain. Since these log-probabilities are negative, longer sequences tend to accumulate more negative values (i.e., larger magnitude), and can therefore dominate the overall score (see Appendix G). The answer confidence in turn serves as a discriminator, often proving decisive when multiple candidate chains exhibit similar reasoning plausibility.
2.3 Confidence Information Plane
To motivate PiCSAR design, we analyse the distribution of model-generated samples on a 2D “Information Plane”, with respect to our two confidence terms (Figure˜3). We partition the plane into four quadrants using the median value of each axis. is used when the model fails to answer (i.e., when no answer token is generated and the answer-confidence term cannot be computed). We compared this fallback value () with various other values, and the results are in Appendix C.8. For Llama-3.1-8B on the MATH500 dataset, we see that correct answers (green) are concentrated in the upper-right quadrant (Q1), corresponding to high scores on both confidence terms.
The quadrant-wise accuracy breakdown is stark: the upper-right quadrant (Q1) achieves 71.7% accuracy, outperforming other quadrants (Q2: 39.0%, Q3: 31.6%, Q4: 62.2%). High reasoning confidence (Q1 and Q4) leads to a higher performance than a high answer confidence (Q2 and Q3). This is reinforced by a statistical t-test that, while both terms are highly significant predictors of correctness, reasoning confidence is a significantly stronger predictor () than answer confidence (). For more details on the statistical tests, see Section˜E.1. Nevertheless, both confidence measures remain essential components for reasoning chain selection.
This principle can be used as a practical filter; tightening the thresholds to the 75th percentile, for instance, isolates a subset of samples with near-perfect accuracy (i.e., 100% on DS-Distilled-Qwen-2.5-7B with AIME2025), providing a mechanism to identify reliable instances (further examples in Appendix˜E). Overall, our analysis reveals that correct reasoning exhibits higher reasoning and answer confidence, with reasoning confidence being a substantially stronger predictor of correctness.
3 Experimental Setup
Models.
We evaluate PiCSAR across a diverse set of recent LLMs and LRMs. Our experiments include LLMs from three major families: Llama-3.1-Instruct (8B and 70B; Dubey et al. 2024), Gemma-2-Instruct (9B; Team et al. 2024), and Qwen3 (8B and 32B; Yang et al. 2025a). For the Qwen3 models, we disable the thinking mode. For LRMs, we include two distilled models from the DeepSeek-R1 series (DS-distill-Llama-3.1-8B and DS-distill-Qwen-2.5b; Guo et al. 2025), and the Qwen-3-8B model with thinking mode enabled. We exclude larger LRMs due to computational cost.
Baselines.
We compare against six baselines: Greedy Decoding (1); Self-Consistency (Wang et al., 2023b) (2); USC (Chen et al., 2023b) (3); p(True) (Kadavath et al., 2022) (4); Self-Certainty (Kang et al., 2025) (5). Confidence-Interval Self-Consistency (CISC; Taubenfeld et al. 2025) is discussed in Section˜C.1, as it involves weight voting. While CISC was originally proposed using p(True), we also report CISC(PiCSAR) for a fair comparison. Due to context length limits and computational constraints, we exclude (3), (4), and (5) in LRMs and set in LLMs.
To isolate each component’s contribution in PiCSAR, we include three ablations in Section˜C.2 and C.3: Reasoning Confidence (), with (6) and without (7) length normalisation, and Answer Confidence () (8). For LRMs, we compare against (1), (2), (6), (7), (8). We also include the upper bound, representing the maximum achievable accuracy when at least one of the candidates is correct. Implementation details can be found in Appendix˜B.
Datasets.
We evaluate LLMs on three maths benchmarks: GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), MATH500 (Hendrycks et al., 2021), and two scientific reasoning benchmarks GPQA-Diamond (Rein et al., 2024), and TheoremQA (Chen et al., 2023a). We additionally evaluate LRMs on AIME 2024 and 2025, omitted for LLMs due to difficulty. Results are averaged over three runs and reported with standard errors.
4 Experimental Results
Performance on LLMs.
In Table 1, we see that when using PiCSAR, Llama models show consistent improvements across all baselines. With on Llama-3.1-8B, PiCSAR outperforms the best-performing baseline (i.e., Self-Certainty) by 3.26% (26.54% 29.80%) on GPQA-Diamond. On Llama-3.1-70B PiCSAR shows similar gains: 7.07% improvement over Self-Certainty and 5.66% over USC. We observe a similar trend on Gemma-2-9B; at , PiCSAR outperforms Self-Consistency by 4.93%. This outcome aligns with our information-plane analysis (see Figure˜3); PiCSAR selects candidates in the top-right, high-accuracy quadrant by maximising the joint score of reasoning and answer confidence. For the Qwen family, PiCSAR generally leads across benchmarks and sample counts (). While there are a few exceptions, PiCSAR maintains the strongest overall profile. For instance, on MATH500 with , it improves the accuracy of Qwen3-32B from 75.93% (Self-Consistency) to 77.00%.
Our results show that PiCSAR outperforms most existing baselines and datasets, demonstrating consistent improvements across various reasoning tasks. As shown in Section˜C.1, CISC (PiCSAR) consistently outperforms CISC (p(True)), indicating its potential for weighting augmentation, but detailed voting strategy analysis remains future work. To verify the statistical significance of our results, we perform the Friedman test Demšar (2006), returning a p-value of , followed by the post-hoc Nemenyi test, which confirms that PiCSAR significantly outperforms all baselines (more in Appendix C.10).
- These findings validate our hypothesis that the model’s confidence provides more informative clues than frequency-based selection.*
PiCSAR is also sample efficient. PiCSAR with a small sampling budget () frequently outperforms both Self-Consistency and Self-Certainty at higher sampling budgets (), narrowing the gap to the upper bound by detecting correct reasoning even within a small sample. For instance, Gemma-2-9B Instruct with ( outperforms (). This indicates that correct reasoning chains are often present in small candidate sets, and that better selection is more important than increased sampling. See Appendix C.7 for details of the upper bound analysis.
Overall, the joint score acts as a paired scoring function: the reasoning confidence provides an assessment of plausibility towards its own reasoning, while the answer confidence, focused on the final answer, serves as a fine-grained discriminator. This approach yields consistent improvements across evaluated models.
Performance on LRMs.
Table 2 reports results from the LRMs. Across 19 out of 21 comparisons, PiCSAR outperforms all baselines. Relative to Self-Consistency, DS-Distill-Llama-3-8B demonstrates substantial improvements on AIME2024 (8.89%) and AIME2025 (8.33%). DS-Distill-Qwen-2.5-7B shows greater improvements compared to Self-Consistency, with an improvement of 12.33% on AIME2024 and of 12.78% on AIME2025. When applied on a relatively more capable model such as Qwen3-8B, PiCSAR increases accuracy by 4.1% and 3.33% on AIME 2024 and AIME 2025, respectively. While improvements on previously evaluated benchmarks (MATH500, SVAMP, GSM8K) yield smaller gains, we observe substantial improvements on GPQA-Diamond: 5.21%, 7.58%, and 5.22% for DS-Distill-Llama-3-8B, DS-Distill-Qwen-2.5-7B, and Qwen3-8B, respectively. These trends mirror those observed with LLMs: gains are most pronounced on challenging datasets where the models’ initial baseline accuracies are relatively lower. The Friedman and post-hoc Nemenyi testing additionally confirm that PiCSAR significantly outperforms all baselines (see Appendix C.10).
PiCSAR, validates the information plane principle in §2.3 and provides a scoring method that improves accuracy both for LLMs and LRMs.*
Comparison with Trained Reward Models.
While our primary baselines consist of training-free BoN methods, a critical question is how PiCSAR compares to explicitly trained verifiers. To establish this, we benchmarked PiCSAR against top-performing reward models on the RewardBench (Lambert et al., 2025) leaderboard, specifically Skywork-Reward-V2-Llama-3.1-8B and LMUnit-qwen2.5-72B. Despite being a completely zero-shot, training-free method, PiCSAR achieves parity with, and in several cases, outperforms these heavily trained reward models across both MATH500 and GSM8K. This confirms that PiCSAR’s probabilistic formulation extracts a signal as reliable as explicit preference tuning, but at zero training cost. Detailed empirical results and analysis for this comparison are provided in Appendix E.2.
5 Further Analysis
In our analysis, we study (1) how information density correlates with accuracy; (2) the confidence-accuracy relationship within each model; (3) the robustness of our confidence metric when generation and evaluation are decoupled.
5.1 Sentence-Level Confidence Dynamics as a Proxy for Reasoning Quality
To understand the dynamics of PiCSAR, we analyse the evolution of answer confidence across reasoning chains. For a given reasoning chain composed of sentences and its corresponding final answer , we measure how the model’s confidence in changes as it processes more of the reasoning. We compute a sequence of scores, , for each partial reasoning prefix , where ranges from 1 to . To capture the characteristics of these confidence sequences, we rank the responses by PiCSAR scoring function into three groups (highest, middle, lowest), and analyse the “peakiness” of the confidence trajectory within each group. We define a peak as a sentence where the confidence exceeds the 95th percentile of all sentence-level scores observed across reasoning chains with the correct answer for that particular problem. The peak-to-sentence ratio is the peak count divided by the total sentences. We term this information density: the proportion of reasoning sentences contributing meaningfully to answer confidence.
Table 3 shows: (1) Higher peak-to-sentence ratio aligns with higher accuracy across different models, showing that reasoning chains that lead to the correct answer tend to have higher information density. For instance, Llama-3.1-8B achieves 53.33% accuracy with a 14.75% ratio in the highest-scoring group, compared to 44.20% with only 8.58% in the lowest; (2) Longer reasoning chains do not necessarily improve accuracy. The lowest-ranked responses are substantially longer yet less accurate. For example, Llama-3.1-8B averages 64.72 sentences with 44.20% accuracy in the lowest group, versus 16.43 sentences with 53.33% accuracy in the highest group. This observation aligns with recent findings of inverse scaling in test-time compute (Chen et al., 2024; Wu et al., 2025; Hassid et al., 2025; Ghosal et al., 2025; Gema et al., 2025a), showing that solely extended reasoning length does not guarantee improved performance.
As unnormalised PiCSAR naturally rewards these high-density, convergent trajectories, it serves as our recommended default. Length normalisation (PiCSAR-N) is typically only necessary when evaluating weaker models that are highly prone to “verbose hallucinations”, where the model accumulates massive negative log-probabilities through unproductive, circular generation rather than meaningful reasoning. We provide a comprehensive sentence-level trajectory analysis detailing the exact criteria and decision boundary for enabling length normalisation in Appendix C.6 and Appendix D.
5.2 Intra-model Confidence Duality
In this section, we investigate the reliability of PiCSAR for predicting correctness within individual models (intra-model reliability analysis). We further examine whether these confidence scores remain comparable across different models (inter-model variance analysis) in Appendix˜J. We fit regressions for the Qwen and Llama families (Figure˜4), with correctness (correct/incorrect) as the dependent variable and the answer confidence score as the independent variable. This approach allows us to interpret the regression slope (), which represents the incremental change in log-odds of correctness per unit increase in confidence score.
We find that the is consistently positive across all model sizes, consistent with prior findings Huh et al. (2024); Goel et al. (2025) of a strong positive relationship between confidence scores and their likelihood of being correct. For example, Qwen3-14B shows a of 0.7255, implying that each unit increase in log-probability more than doubles the odds of correctness (). The Point-Biserial Correlation Coefficient further confirms the positive relationship by measuring the linear association between binary correctness and continuous confidence. These findings show that PiCSAR serves as a reliable predictor of correctness within each model. See Appendix˜I for more details.
5.3 Confidence Portability: Decoupling Generation from Evaluation
Having established the properties of the confidence signal within a single model, we extend our analysis to multi-model scenarios, evaluating confidence signal robustness when generation and evaluation are decoupled. This decoupling is motivated by practical system design, where one might use a costly API model for reasoning confidence, while relying on a smaller local model for answer confidence estimation. In this decoupled setting, the model that generates the reasoning chain () differs from the model that evaluates the answer confidence (). The scoring function for a chain generated by becomes:
[TABLE]
We test this by having generate reasoning chains, and various models acting as . For LRMs, the base instruction tuned model is used as . Results in Figure 5 and Appendix 6 show that overall accuracy remains largely unaffected under this decoupling, with only minor degradation even when is a significantly smaller model than . For instance, accuracy remains similar when is generated by Llama-3.1-70B, while is estimated with other smaller models. This suggests that the answer confidence term, , is not merely a model-specific artefact but functions as a more portable measure of the logical entailment between a given reasoning chain and its conclusion, enabling flexible and computationally efficient answer confidence prediction.
6 Related Work
LLM Reasoning LLM reasoning abilities has gained significantly on complex tasks (Li et al., 2025; Muennighoff et al., 2025). While CoT reasoning improves performance (Wei et al., 2022; Leang et al., 2025a), subsequent work introduced hierarchical reasoning phases: multi-path exploration (Yao et al., 2023; Guan et al., 2025), step verification (Lightman et al., 2024; Leang et al., 2025b), and iterative refinement (Madaan et al., 2023). These techniques are computationally prohibitive for LRMs (Team et al., 2025; Yang et al., 2025a), which produce long, unstructured outputs.
BoN.
BoN is an alignment-via-inference method that optimises outputs with a scoring function (Charniak and Johnson, 2005; Stiennon et al., 2020; Amini et al., 2024). With scale-time inference, LLMs benefit from generating multiple samples and selecting the best via reward models (Snell et al., 2024; Wu et al., 2024). Due to their training cost, reward models are often replaced by training-free methods such as Self-Consistency and its variants (Wan et al., 2024; Lyu et al., 2025).
Sampling and Reranking.
Reranking improves generation quality (Adiwardana et al., 2020; Shen et al., 2021), often via trained verifiers to re-rank candidates, outperforming fine-tuning (Cobbe et al., 2021; Guan et al., 2025). Confidence estimation for re-ranking has been explored via sample agreement (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2024; Simhi et al., 2025), or prompting models to verbalise confidence (Tian et al., 2023; Kadavath et al., 2022).
7 Conclusion
We introduced PiCSAR, a sample-efficient, training-free scoring function for BoN sampling that selects a reasoning chain by maximising a score decomposed into reasoning and answer confidence. PiCSAR yields consistent gains across models and datasets, narrowing the gap to oracle performance while requiring only samples to outperform baselines using . The answer confidence component can be estimated by different models than the one used for generation, enabling flexible and efficient deployment. At the trajectory level, peak-count-to-sentence ratios correlate with accuracy, showing that reasoning chains leading to correct answers are more information-dense. Overall, PiCSAR offers a promising probabilistic confidence route to reasoning selection.
Limitations
PiCSAR targets domains with well-defined reasoning structures and definitive answers, such as mathematical and scientific problem-solving. We view this scope as both deliberate and essential: these domains represent a substantial class of high-value reasoning tasks where precision is important. Furthermore, restricting our analysis to these settings enables a rigorous evaluation of confidence calibration, a task that remains difficult in open-ended domains – could be characterised by ambiguity and multiple valid solutions. This controlled environment allows us to validate the efficacy of model confidence as a selection metric without the confounding factors of subjective evaluation.
Extending PiCSAR to open-ended generation remains an important avenue for future research. To address the lack of definitive answer boundaries in such tasks, a promising direction is to augment the probabilistic framework with learned reward models for answer evaluation. We believe this adaptation could extend the reliability benefits of PiCSAR beyond fixed-format problems, offering a pathway toward robust reasoning in broader, general-purpose applications.
Acknowledgements
We thank the anonymous reviewers and area chairs for their helpful comments and feedback. We also thank Waylon Li and Adi Simhi for their valuable feedback. Lastly, we are grateful for the compute resources provided to us by the University of Edinburgh (Edinburgh International Data Facility), and UKRI (Isambard AI service, University of Bristol).
Appendix A Additional Results for Decoupled Confidence Estimation
In this section, we provide supplementary evidence that the decoupled confidence estimation experiments introduced in §5.3 are portable across distinct evaluator models. This analysis aims to strengthen the claim that the answer-confidence term, , does not depend on the specific evaluator used.
Based on Figure 6(a), switching the evaluator model, while holding the reasoning distribution fixed yields a similar accuracy across datasets. This observation shows that the answer-confidence term, , is highly portable, allowing small-scale LLMs to reliably evaluate the reasoning chains of larger models.
When examining LRMs, we observe the same qualitative pattern (shown in Figure 6(b)), indicating that the phenomenon generalises across models. This reinforces the hypothesis that decoupled confidence estimation captures a stable property of the reasoning process itself, rather than an artefact of the evaluator model.
Appendix B Additional Implementation Details
Sampling and Decoding.
For sampling-based methods, we use reasoning traces for smaller models and for the larger Llama-3.1-70B and Qwen3-32B models, due to computational constraints. For all the models, we apply a hyperparameter of temperature and top-p . The greedy decoding (temperature , top-p ) baseline corresponds to , for which we report accuracy. For specialised LRMs, we use uniformly across all methods due to computational constraints. Since LRMs are not typically evaluated using greedy decoding, we follow the approach of Yang et al. (2025a), which is a temperature of 0.6, top-k of 20 and top-p , reporting the average accuracy across samples. For all our baselines except greedy decoding, we evaluate three times with the standard error reported. For LLMs, we cap the maximum token budget at 8,096 tokens. For LRMs, we follow the configuration of Yang et al. (2025a), using a maximum output length of 32,768 tokens, except for AIME’24 and AIME’25, where we extend the budget to 38,912 tokens to ensure sufficient reasoning space.
Baselines and Hyperparameters.
We compare PiCSAR against a range of decoding, confidence and re-ranking baselines.
- •
Greedy Decoding As a deterministic decoding strategy, greedy decoding selects at each step the token with the highest conditional probability. Unlike greedy decoding, which selects a single high-probability continuation, PiCSAR evaluates multiple full reasoning trajectories and ranks them using joint reasoning-and-answer log-likelihood, enabling selection of the most globally probable chain.
- •
Self-Consistency (SC; Wang et al. 2023b). This method samples reasoning chains and aggregates predictions via majority voting on the final answer. In cases where multiple answers receive equal support, we break ties by selecting one at random. While SC relies purely on majority voting over final answers, PiCSAR incorporates the full reasoning chain’s token-level likelihood along with answer confidence, allowing it to prefer coherent but minority reasoning paths that SC would discard.
- •
Universal Self-Consistency (USC; Chen et al. 2023b). We include USC only for LLMs under sampling, as prompt and context length restrictions prevent its application in the LRM setting. We use the prompting strategy proposed in Chen et al. (2023b). Unlike USC, which asks the model to internally judge “consistency” among samples, PiCSAR uses a probabilistic, model-agnostic scoring function based directly on log-likelihoods of reasoning and answers, avoiding USC’s reliance on model self-evaluation and context-window limits.
- •
Self-Certainty (Kang et al., 2025). This method applies KL-divergence-based confidence scores, aggregated via Borda voting with parameter . It provides a probabilistic variant of self-consistency, where each candidate’s confidence distribution informs the re-ranking process. Instead of re-ranking chains with KL-based self-estimated correctness like Self-Certainty, PiCSAR scores each candidate through the true generative probabilities of its entire reasoning path and answer
- •
P(True) (Kadavath et al., 2022). This method prompts the model to evaluate whether the answer or reasoning is True or False, then parses the probability of the response. While P(True) extracts a scalar correctness probability from a meta-prompt, PiCSAR leverages the actual likelihood structure of the model’s forward pass, combining reasoning and answer probabilities without relying on verbalized or poorly calibrated self-judgments.
- •
CISC (Taubenfeld et al., 2025). This method aggregates multiple sampled reasoning paths by weighting each path’s vote with the model’s own estimated correctness. For a fair comparison, we compare CISC with PiCSAR as estimated correctness, termed CISC (PiCSAR), with CISC (P(True), which originally proposed, in Appendix C.1.
We have summarised the novelty of PiCSAR against other baselines in Table 4.
Baseline Restrictions.
Due to context length constraints, USC can only handle a limited number of samples and is therefore evaluated exclusively in the LLM setting with , and excluded from all LRM experiments.
Ablations.
To disentangle the contributions of the two terms in our joint objective, we introduce single-term ablations. Reasoning Confidence ranks candidates solely by , favouring plausible reasoning traces. Answer Confidence instead ranks by , prioritising certainty in the final answer given the reasoning path.
Framework and Hardware.
All experiments are conducted using the vLLM framework (Kwon et al., 2023). All experiments are conducted on 2–4 NVIDIA H100 GPUs (80GB). Results are reported as averages over independent evaluation runs to ensure robustness.
Prompt.
For the reasoning confidence generation, we utilise the following prompt:
You are a helpful AI Assistant that provides well-reasoned and detailed responses. Think step by step and provide the final answer in the form of ‘The final answer is: [answer]’. Decompose and break down your reasoning into smallest possible steps (Do not combine multiple inferences in one step), and do label your steps very clearly with ‘Step 1... \n\n Step 2... \n\n Step 3.... \n\n..... \n\n Step N-1..... \n\n Step N \n\n The final answer is: [answer]’.
For predicting answer confidence , we follow a similar method to (Ton et al., 2024) but without training. Specifically, we use the prompt template with 5-shot learning:
You are a helpful assistant. When you see a potential partial reasoning followed by ‘<sep>’, output the final answer.
B.1 Analyis of Prompts
To verify that the observed improvements are not attributable to the explicit instruction prompt (see (3)), we evaluated several alternative prompt formulations on the Llama-3.1-8B model. Using the MATH500 benchmark, we compared the resulting answer-confidence estimates across prompts.
Prompt 1: "You are a helpful assistant. When you see a potential partial reasoning followed by ’<sep>’, output the final answer. Here are some examples" + system_contents + "You are not allowed to provide any redundant symbols at for the final answer, including ’#’, ’/’, ’$’, ’**’ or others. Please only provide numbers as the final answer."
Prompt 2 (original prompt): "You are a helpful assistant. When you see a potential partial reasoning followed by ’<sep>’, output the final answer. Here are some examples"
Prompt 3: "You are a helpful assistant. By providing the partial reasoning, output the final answer directly without any additional texts."
Prompt 4: "You are a helpful assistant. Based on the reasoning provided, output the final answer directly without any additional texts. Only Provide the final answer."
Prompt 5: "You are a helpful assistant. Provide the final answer directly without any additional texts (only the final answer) based on the partial reasoning."
Our results in Table 5 show that changes in prompt phrasing have minimal influence on model performance. This suggests that, although the instructional content of a prompt remains essential for eliciting the final answer, the precise wording plays only a limited role in shaping the model’s behaviour.
Appendix C Further Experimental Results and Ablation Studies
C.1 Comparison between CISC (p(True)) and CISC (PiCSAR)
Based on Table˜6, PiCSAR shows a great performance when integrated with weightage voting on CISC (Taubenfeld et al., 2025), consistently improving baseline CICS (p(True)) metrics across all evaluated methods. This indicates that PiCSAR functions effectively both as a standalone selection mechanism and as an augmentation to existing weighting schemes. While these findings suggest promising direction for performance optimisation, this lies beyond the current research scope.
C.2 Component Analysis and Main Results Breakdown
In this section, we first provide a detailed breakdown of the experimental results for all methods, as summarised in Table 7, where we also show the performance of PiCSAR-N, a length-normalised variant of our primary method. Finally, we present ablation studies on LRMs in Table 8. We compare three primary approaches: Reasoning Confidence (), Answer Confidence (), and our main method, PiCSAR (the joint probability).
Across the majority of benchmarks and model families presented in Table 7, we generally observe that PiCSAR outperforms its individual components. This pattern underscores the benefit of jointly considering the likelihood of both the reasoning process and the final answer. However, there are specific instances where relying solely on answer confidence, , achieves comparable or slightly better results (e.g., Gemma-2-9B and Qwen3-32B on GPQA-Diamond for ), highlighting that answer confidence remains a strong and competitive signal on its own.
C.3 Length-Normalised Variant: PiCSAR-N
As introduced in the main paper, we proposed a variant of our method, PiCSAR-N, which applies length normalisation to the reasoning confidence term. The scoring function for PiCSAR-N is defined as:
[TABLE]
where is the number of tokens in the reasoning chain . This normalisation is intended to mitigate any potential length bias, which might unfairly penalise longer reasoning paths.
C.4 Analysis between Token Length, PiCSAR score, and Model Performance
Figure 7(a) shows that correct instances predominantly cluster in regions of high probability and short sequence length, indicating that concise reasoning is strongly associated with higher quality. This pattern is reinforced by Figure 7(b), which demonstrates a consistent decline in accuracy as sequence length grows. Together, the two figures highlight that shorter, more confident reasoning trajectories tend to yield more accurate performance.
C.5 Ablation Studies on LLMs and LRMs
The results for PiCSAR-N are included in Table 7 and Table 8. As shown, both PiCSAR and PiCSAR-N consistently surpass the other baselines, including their corresponding reasoning confidence metrics (with and without normalisation). The performance difference between PiCSAR and PiCSAR-N is not consistently in one direction; each variant excels on different model-dataset combinations. For instance, PiCSAR-N shows stronger performance with Gemma-2-9B on MATH500 () and GPQA-Diamond, whereas the non-normalised PiCSAR is clearly superior for Llama-3.1-8B across most settings. This suggests that the utility of length normalisation may depend on model-specific characteristics, such as tendencies towards verbosity.
Based on Table 7, we also observe that 20/40 results of the length-normalised (PiCSAR-N) versions outperform the non-length normalised versions (PiCSAR), demonstrating that length-normalisation does not perform worse than the non-length normalised version. This suggests that length normalisation is not detrimental and does not consistently weaken PiCSAR.
We further conducted ablation studies on LRMs, with results reported in Table 8. Here, we compare PiCSAR and PiCSAR-N against both standard and normalised reasoning confidence, as well as answer confidence. The results confirm that our joint probability methods, PiCSAR and PiCSAR-N, consistently achieve top performance, similar to the findings with LLMs. Interestingly, we observe that maximising answer confidence alone yields strong results, sometimes comparable to PiCSAR, particularly on the DS-Distill-llama-3-8B model. This reinforces the value of the answer confidence signal while highlighting the general effectiveness of PiCSAR’s approach in combining both reasoning and answer confidence.
C.6 Further Analysis on Length-Normalised Variant: PiCSAR-N
In this section, we clarify the distinctions between PiCSAR and its length-normalised counterpart, PiCSAR-N, establishing empirical evidence for when length normalisation should be applied.
As shown in Table˜7, the performance gap between PiCSAR and PiCSAR-N is generally marginal, with neither variant strictly dominating across all model-task configurations. We introduce PiCSAR-N primarily as an ablation to confirm that PiCSAR’s strong performance is not merely an artifact of systematically penalising longer generations. Therefore, we recommend the unnormalised PiCSAR as the default selection strategy. Our empirical analysis (see§ 5.1) suggests that correct reasoning chains exhibit high “information density”. They accumulate log-probability mass efficiently as they converge toward the final answer. The unnormalised joint log-likelihood naturally favours reasoning paths that are both highly probable and structurally concise, effectively penalising indirect reasoning paths.
Conversely, PiCSAR-N proves beneficial primarily for weaker models prone to “verbose hallucinations”, instances where a model generates locally plausible but logically stagnant text that accumulates massive negative log-probabilities strictly due to sequence length. For highly capable reasoners (e.g., Llama-3.1, Qwen3), the unnormalised score remains highly robust, as these models’ sequence-level log-probabilities serve as well-calibrated proxies for both logical coherence and problem-solving efficiency.
C.7 The Importance of Selection: Interpreting the Upper Bound
While PiCSAR consistently outperforms other heuristics, it necessarily falls short of the oracle Upper Bound, whose behaviour provides insight into the underlying challenges. On easier benchmarks such as SVAMP and GSM8K, the upper bound saturates quickly. For instance, increasing the sample size from to with Llama-3.1-70B on GSM8K raises accuracy only marginally from 96.91% to 97.44%, indicating that correct reasoning paths are usually present in small sample sets, and that selection rather than generation is the main bottleneck. In contrast, on more demanding tasks such as MATH500 and GPQA-Diamond, the upper bound continues to rise with larger , as seen with Gemma-2-9B on GPQA-Diamond where accuracy jumps from 55.22% to 82.49%, reflecting the intrinsic difficulty of generating correct answers. In both regimes, PiCSAR demonstrates its value: in selection-limited settings, it reliably identifies correct candidates from small pools, while in generation-limited scenarios, it narrows the gap to the oracle by detecting correct reasoning even when correct answers are sparse, highlighting that improving selection is often as important as enlarging the sampling budget.
C.8 Analysis of Fallback Mechanism
To assess how sensitive our method is to the penalty assigned when a generation fails, i.e., no answer token is produced and the answer-confidence term cannot be calculated, we tested several fallback values for the Answer Confidence score (). Specifically, we compared our default setting of with more conservative penalties of and . As shown in Table 9, downstream accuracy is unchanged across all configurations. This indicates that, as long as the fallback value is sufficiently low to denote a failure state, its precise magnitude does not affect candidate rankings.
C.9 Analysis of Performance with Number of Samples and Temperature
We first examine the scaling behavior of PiCSAR regarding the number of candidate generations (). We evaluate Gemma-2-9B on the SVAMP dataset with sample budgets ranging from to . As shown in Table 10, PiCSAR exhibits scaling properties, with accuracy consistently improving as the candidate pool expands (rising from 89.11% at to 90.22% at ). In contrast, Self-Consistency plateaus earlier and is consistently outperformed by PiCSAR. This indicates that PiCSAR is more effective at leveraging larger compute budgets to identify correct reasoning chains.
Additionally, we assess the stability of our method with respect to generation stochasticity by comparing performance at sampling temperatures of and . The results, summarized in Table 10, reveal negligible performance variance (89.89% vs. 89.67%). These results indicate that PiCSAR is robust to moderate changes in generation hyperparameters and maintains high precision even under more stochastic sampling conditions ().
C.10 Nemenyi Post-hoc Test for PiCSAR
In Figure 8 and Figure 9 we show the critical diagrams obtained by performing the Nemenyi post-hoc test. In the critical diagram, the group of methods that do not differ significantly (significance level 0.05) are connected through a horizontal line.
At , the Nemenyi post-hoc test shows that PiCSAR (average rank ) is significantly better than all other methods, as its rank is well separated beyond the critical difference from Greedy Decoding, Self-Consistency, Self-Certainty, and p(True). This indicates that PiCSAR consistently outperforms the alternatives across datasets, and its superior performance is statistically robust rather than due to random variation.
As for LRM, statistical analysis utilising the Friedman test revealed highly significant performance differences across the methods (). Subsequent Nemenyi post-hoc comparisons confirmed that PiCSAR significantly outperforms both the Average and Self-Consistency baselines, showing mean rank differences of and respectively, both of which substantially exceed the critical difference of at .
Appendix D Sentence Level Analysis
In this section, we provide analysis between senetence level of PiCSAR and PiCSAR-N.
Our empirical analysis across AIME2025, AIME2024, and MATH500 with DeepSeek-R1-Distill-Qwen-2.5-7B demonstrates that this sensitivity is well-calibrated and reflects genuine quality differences rather than arbitrary length penalisation. Excessively long reasoning chains (10K tokens) achieve only 14.1%, 13.5%, and 36.5% accuracy on these three benchmarks, compared to 92.7%, 82.1%, and 69.3% for the 1K–5K token range. Wrong answers are consistently 2–3 longer than correct ones (e.g., 15,761 vs. 5,651 tokens on AIME2025), confirming that excessive length may signal model uncertainty.
We further conducted a sentence-level trajectory analysis of across reasoning chains for Qwen3-8B.
On AIME2024 (thinking mode):
- •
The highest-PiCSAR generations (avg 650 sentences) begin with around at early sentence positions and rise steadily, peaking near — demonstrating that each reasoning step productively advances toward the correct answer.
- •
The lowest-PiCSAR generations (avg 1,700 sentences, 2 longer) have oscillating erratically between and throughout, never reaching the sharp convergent peak observed in high-PiCSAR samples.
On MATH500 (no-think mode), the same pattern holds:
- •
High-PiCSAR generations (avg 660 sentences) show climbing monotonically from to a peak near .
- •
Low-PiCSAR generations (avg 1,057 sentences, 1.6 longer) rise more slowly, plateau at substantially worse values (), and exhibit noisy fluctuations rather than clean convergence.
These trajectory analyses demonstrate that the additional length in low-PiCSAR generations does not yield meaningful progress in reasoning tasks — the model cycles through unproductive loops without converging. PiCSAR’s unnormalised faithfully captures this distinction by encoding the model’s own confidence in a reasoning path, naturally assigning lower scores to long, uncertain chains. However, as shown in the example above regarding Qwen3-8B on AIME2024 and MATH500, when each generation converges to the answer, accuracy increases with the length. This length sensitivity is therefore a desirable property that rewards efficient, convergent reasoning over verbose, aimless generation.
We further provide three different cases in Appendix H: (1) a lengthy generation that arrives at the correct answer through extended thinking, and (2) a concise generation with high information density that likewise yields the correct answer. (3) The impact of answer confidence in generation selection.
Appendix E Additional Experiments for Confidence Information Plane
In this section, we show all the models across datasets (GSM8K, MATH500 and AIME2024), which consist of a variety of difficulties. We observe a consistent pattern across PiCSAR. In addition, the utility of our confidence metric extends to filtering for high-reliability answers. For GSM8K and MATH500, we use the median as our threshold with outliers removed, similar to §2.3. However, as for AIME2024, as the instance is similar, we include all the instances including the outliers, and set the threshold to 60% for both x and y-axis. We show results on GSM8K in Figure˜10–14. Similarly, results on MATH500 are provided in Figures 15–19. We provide results on AIME 2024 in Figures 20–23. In addition, we also show results of using 75th percentile as the threshold in Figure˜24. As shown in Figure˜24, increasing the confidence thresholds from the median to the 75th percentile isolates a region in the Information Plane with significantly higher accuracy, effectively identifying the most trustworthy solutions.
E.1 Statistical Tests
In this part, we present detailed results of the statistical tests described in §2.3. We conduct these tests on the MATH500 dataset, with all results reported in Table˜11.
The terms reported in the table correspond to reasoning confidence and answer confidence. The t-statistic measures the degree of separation between correct and incorrect responses. The associated p-value confirms that both confidence metrics significantly contribute to this distinction rather than arising from random variation. The U-statistic provides a non-parametric validation that the scores for correct responses are stochastically distinct from those for incorrect ones. The Cohen’s d quantifies the magnitude of this effect size. The mean (C/I) indicates the directionality of the relationship, showing that the average log-probabilities are consistently higher for correct responses than for incorrect ones.
To observe whether the ability of the base model itself influence the performance of PiCSAR, we perform a one-way ANOVA testing whether the choice of base model affects PICSAR’s improvement over the best competing baseline across all benchmarks. The result , indicates no statistically significant difference, confirming that PICSAR provides consistent gains regardless of the underlying model. This is supported by our findings in 5.3, where we show that the answer-confidence component can be evaluated by a relatively smaller model without degrading selection quality, removing the requirement for the scoring model to match the generator in scale or architecture.
E.2 In Comparison between PiCSAR and Trained Verifiable Rewards
In this section, we compare PiCSAR with trained verifiable rewards. We evaluated PiCSAR with two best performing Reward Models from RewardBench (Lambert et al., 2025), Skywork-Reward-V2-Llama-3.1-8B and LMUnit-qwen2.5-72B.
Based on Table 12 and 13, PiCSAR achieves parity with (and occasionally outperforms) Skywork-Reward-V2-Llama-3.1-8B, a model explicitly trained on massive preference datasets, despite PiCSAR being zero-shot and training-free. While larger RMs (i.e., LMUnit-qwen2.5-72B) generally perform better (avg +1-2%), the fact that PiCSAR is competitive with trained verifiers confirms its high effectiveness, especially due to its zero additional training cost.
Appendix F Dataset Details
- •
SVAMP Patel et al. (2021): https://github.com/arkilpatel/SVAMP, License: SVAMP License
- •
GSM8K Cobbe et al. (2021): https://huggingface.co/datasets/openai/gsm8k, License: GSM8K License
- •
MATH Hendrycks et al. (2021): https://huggingface.co/datasets/HuggingFaceH4/MATH-500, License: MATH License
- •
GPQA-Diamond Rein et al. (2024): https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond/train, License: GPQA License
- •
TheoremQA Chen et al. (2023a): https://huggingface.co/datasets/TIGER-Lab/TheoremQA, License: GPQA License
- •
AIME-2024: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, License: AIME-2024 License
- •
AIME-2025: https://huggingface.co/datasets/opencompass/AIME2025, License: AIME-2024 License
GSM8K
MATH500
AIME2024
AIME2024 with various Quadrants
Appendix G Example of PiCSAR Scoring
Appendix H Further Case Study on of PiCSAR Scoring
H.1 Case 1 & 2
H.2 Case 3
Appendix I Intra-model Reliability
To support the intra-model results in Section˜5.2, we analyse the calibration of PiCSAR’s confidence signal using the evaluation traces collected for the Qwen3 family. For every sample we pair the answer log-probability with its correctness label and fit a separate model per backbone. The resulting calibration curves in Figure˜4 exhibit a consistent monotonic trend: the logistic slopes are , , and for Qwen3-8B, 14B, and 32B respectively, and the corresponding point-biserial coefficients (, , ) show a positive correlation between higher confidence and the probability of a correct answer.
Figure˜28 also shows how this effect manifests in the raw score distribution. Correct solutions concentrate around higher confidence values (closer to zero log-probability), whereas incorrect ones remain several nats lower, leaving limited overlap in the high-confidence region.
I.1 Logistic Regression Experimental Training
We model the relationship between confidence and correctness using logistic regression, similar to Gema et al. (2025b). The binary outcome variable encodes whether the final answer is correct (), while the predictor is the model’s confidence score expressed as the log-probability of the final answer:
[TABLE]
where is the sigmoid function. The regression coefficient quantifies the change in log-odds of correctness per unit change in confidence. A positive indicates that higher confidence increases the likelihood of correctness. For instance, as shown in Figure 28(b), in Qwen3-14B, corresponds to more than doubling the odds of correctness ().
I.2 Point-biserial Correlation Coefficient
As a complementary measure to logistic regression, we compute the point-biserial correlation coefficient between confidence scores (continuous) and correctness (binary). This statistic, mathematically equivalent to Pearson’s correlation with a dichotomous variable, directly quantifies the strength of association between the two. It is defined as
[TABLE]
where and denote the mean confidence scores for correct and incorrect samples, is the pooled standard deviation, and are the respective sample counts. The coefficient is bounded in , with positive values indicating alignment between confidence and correctness. For instance, an of 0.35 for Qwen3-14B indicates a moderate positive association. Together with logistic regression, this provides a scale-free validation that confidence is a consistent predictor of correctness within a given model.
Appendix J Inter-model Variance
Inter-model variance analysis challenges the assumption that confidence scores represent universal correctness measures across different models. While intra-model reliability remains stable across different model sizes and architectures, confidence scores cannot be compared across models of different parameter sizes and architectures. As shown in Figure˜29, the Llama family exhibits predictable trend: both accuracy and confidence increase with model size. In contrast, the Qwen family shows a non-monotonic relationship; Qwen3-1.7B achieves the highest confidence while showing the lowest accuracy. This difference implies that while there is a general trend that confidence is a useful proxy for selecting an accurate reasoning path from a set of candidates within models, but its actual value is model-specific and incomparable across different models.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and 1 others. 2020. Towards a human-like open-domain chatbot . Ar Xiv preprint , abs/2001.09977.
- 2Amini et al. (2024) Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. 2024. Variational best-of-n alignment . Ar Xiv preprint , abs/2407.06057.
- 3Balunović et al. (2025) Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. Matharena: Evaluating LL Ms on uncontaminated math competitions . Ar Xiv preprint , abs/2505.23281.
- 4Charniak and Johnson (2005) Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and Max Ent discriminative reranking . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) , pages 173–180, Ann Arbor, Michigan. Association for Computational Linguistics. · doi ↗
- 5Chen et al. (2023 a) Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023 a. Theorem QA: A theorem-driven question answering dataset . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7889–7901, Singapore. Association for Computational Linguistics. · doi ↗
- 6Chen et al. (2024) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, and 1 others. 2024. Do not think that much for 2+ 3=? on the overthinking of o 1-like LL Ms . Ar Xiv preprint , abs/2412.21187.
- 7Chen et al. (2023 b) Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2023 b. Universal self-consistency for large language model generation . Ar Xiv preprint , abs/2311.17311.
- 8Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems . Ar Xiv preprint , abs/2110.14168.
