The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Seonglae Cho; Zekun Wu; Kleyton Da Costa; Adriano Koshiyama

arXiv:2602.08159·cs.LG·February 10, 2026

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama

PDF

Open Access

TL;DR

This paper explores the geometric structure of correctness signals in language models, revealing a low-dimensional, linear, and internally represented space that enables effective correctness detection through simple geometric measures.

Contribution

It characterizes the geometric structure of correctness representations across multiple models, demonstrating their low-dimensional, linear, and internally encoded nature, and introduces effective detection methods.

Findings

01

Correctness signals occupy 3-8 dimensions in the models.

02

Centroid distance in low-dimensional space correlates with correctness detection accuracy.

03

Internal correctness signals are not directly expressed in model outputs.

Abstract

When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis