Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Marc Boubnovski Martell; Josefa Lia Stoisser; Kaspar M\"artens; Jialin Yu; Robert Kitchen; Philip Torr; Jesper Ferkinghoff-Borg

arXiv:2605.06308·cs.AI·May 8, 2026

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar M\"artens, Jialin Yu, Robert Kitchen, Philip Torr, Jesper Ferkinghoff-Borg

PDF

TL;DR

This paper introduces a geometry-based black-box confidence scoring method for chain-of-thought reasoning, improving over self-consistency by leveraging trajectory convergence without requiring logits or supervised calibration.

Contribution

It proposes a novel trajectory-confidence score that measures convergence of reasoning traces, enhancing confidence estimation in black-box models without additional supervision.

Findings

01

The method outperforms self-consistency in 6/6 benchmark settings.

02

Geometry peaks in the penultimate window and inverts at the terminal window on GPQA Diamond.

03

Fusion of channels yields Pareto improvements over existing methods.

Abstract

Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.