Training-free LLM Verification via Recycling Few-shot Examples

Dongseok Lee; Jimyung Hong; Dongyoung Kim; Jaehyung Kim

arXiv:2506.17251·cs.LG·October 2, 2025

Training-free LLM Verification via Recycling Few-shot Examples

Dongseok Lee, Jimyung Hong, Dongyoung Kim, Jaehyung Kim

PDF

Open Access 3 Reviews

TL;DR

The paper introduces ReFeri, a training-free framework that recycles few-shot examples to verify and select the most accurate LLM outputs, significantly enhancing performance without extra training.

Contribution

ReFeri uniquely uses few-shot examples to evaluate and verify LLM outputs, improving accuracy through a novel scoring method without additional training.

Findings

01

Achieves an average of 4.8% accuracy gain across tasks

02

Effective response selection improves LLM performance

03

Works with multiple LLMs and diverse tasks

Abstract

Although LLMs have achieved remarkable performance, the inherent stochasticity of their reasoning process and varying conclusions present significant challenges. Majority voting or Best-of-N with external verification models has been explored to find the most promising solution among multiple LLM outputs. However, these approaches have certain limitations, such as limited applicability or the cost of an additional training step. To address this problem, we propose a novel and effective framework that Recycles Few-shot examples to verify LLM outputs (ReFeri). Our key idea is to additionally utilize the given few-shot examples to evaluate the candidate outputs of the target query, not only using them to generate outputs as the conventional few-shot prompting setup. Specifically, ReFeri evaluates the generated outputs by combining two different scores, designed motivated from Bayes' rule,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Paper is well-written and easy to follow. - The authors motivate their proposal well and support the effectiveness of their method through various additional experiments. - Even though the confidence score calculation and usage already exist in the literature and do not provide much to the community, I liked the incorporation of the backward confidence score.

Weaknesses

- A recent related work, which also uses uncertainty for best of N selection, is missing in the evaluations and related works: Zhewei Kang et al, Scalable Best-of-N Selection for Large Language Models via Self-Certainty, 2025 - I think one simple and crucial baseline is missing: self-consistency (majority voting). Selecting the most common answer across sampled generations. This is a stronger baseline than random selection. - Seeing the golden performance would be helpful for better analysi

Reviewer 02Rating 4Confidence 4

Strengths

- Test-time scaling and verification of LLM outputs are crucial to improve LLM performance in complex tasks such as reasoning. Therefore, advances in LLM output verification are definitely called for. - The proposed method is very efficient in the sense that it does not require additional training (as it is usually the case for RPMs).

Weaknesses

- While the forward confidence score is fairly straightforward, I found the presentation of the backward confidence score a bit confusing. Perhaps an illustrative example would help to better convey the technical details. - The performance measures reported in Table 1 lack standard deviations. - The novelty of the proposed approach is fairly limited, it is basically a simple combination of sequence probabilities that seem to work in practice on a selection of datasets. The paper does not prov

Reviewer 03Rating 2Confidence 4

Strengths

1. The core idea of "recycling" few-shot examples for verification, not just generation, is a novel perspective. 2. The experimental design is a major strength. The use of seven diverse benchmarks and three LLMs provides compelling evidence for the method's generalizability.

Weaknesses

1. The proposed method is a re-interpretation of likelihood-based reranking and LLM-as-judge techniques, both of which already use internal model probabilities or few-shot conditioning. The paper fails to clearly delineate how ReFeri fundamentally differs from prior “Best-of-N” and likelihood-based methods (e.g., CoT-WP, PRM, ORM, or self-consistency with verification). The novelty is slightly diminished by the existence of methods like CoT-WP, which already use the forward score. The primary in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques