Unsupervised Elicitation of Language Models
Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike

TL;DR
This paper introduces an unsupervised algorithm called Internal Coherence Maximization (ICM) that fine-tunes pretrained language models using their own generated labels, eliminating the need for external supervision and outperforming human-supervised methods on several tasks.
Contribution
The paper presents ICM, a novel unsupervised fine-tuning method for language models that matches or exceeds performance of supervised training, especially on superhuman capability tasks.
Findings
ICM matches performance of training on golden labels.
ICM outperforms training on crowdsourced labels.
ICM improves training of frontier language models.
Abstract
To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden labels and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use…
Peer Reviews
Decision·Submitted to ICLR 2026
+ Clever, self-justifying idea: if the model already “knows” a concept, use that signal instead of noisy human labels. The objective is clean and intuitive. + The framework is modular: predictability term + logical consistency + simple search/repair loop. + Salience analysis is honest and useful (the method fails when the concept isn’t in the model). + Early signs the approach can scale (reward modeling / RL) rather than being just a small-bench trick.
- Scope/generalizability unclear. Most demonstrations look like binary or pairwise decisions (true/false, better/worse). It’s not clear how the objective behaves with non-binary targets. The paper reads a bit specialized to “logical-consistency-style” problems. - Missing self-rewarding/self-training baselines. For a claim of “unsupervised elicitation,” comparisons to modern self-rewarding / RLAIF-style methods (LM-as-judge or LM-derived rewards), and simple self-training with confidence filters
This paper stands out for its originality and surprisingly strong results. The idea of training LMs without any human labels—using Internal Coherence Maximization to find logically consistent, self-generated labels—is both simple and powerful. The experiments convincingly show that ICM can match or beat human-supervised baselines and even train a Claude 4 assistant competitively. The method feels timely and meaningful as models grow beyond human supervision, and the authors back it up with clear
The main limitation is that ICM’s success depends heavily on how well the underlying model already understands the target concept. When the concept isn’t salient, the method collapses to random guessing. The paper could also do more to explain why mutual predictability works so well—right now it feels more empirical than theoretical. In addition, using closed models like Claude limits reproducibility and makes it hard to verify the claimed parity with human-supervised training.
1. Well-motivated and important problem: studying how to improve language models without human supervision is an important topic in the field, especially for hard tasks such as math and scientific research. 2. Clear idea and simple methodology that works well: the proposed ICM method is conceptually neat and seems easy to implement. Meanwhile the results are pretty strong given this intuitive approach. 3. Broad evaluation: the authors conduct many experiments and ablations to study ICM and show
**The "superhuman" framing and claim is problematic given the evaluation method** - Why are GSM8K, TruthfulQA, and Alpaca used as proxies for superhuman supervision tasks? Why not include harder/cleaner/less-contaminated "superhuman" benchmarks (e.g., MATH, GPQA, AIME, etc.) if the central claim is eliciting beyond human-quality capabilities? - The "superhuman capability" demonstration uses a gender prediction task. This seems more like a patten matching task instead of complex reasoning, and in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning and Data Classification
