Unsupervised Elicitation of Language Models

Jiaxin Wen; Zachary Ankner; Arushi Somani; Peter Hase; Samuel Marks; Jacob Goldman-Wetzler; Linda Petrini; Henry Sleight; Collin Burns; He He; Shi Feng; Ethan Perez; Jan Leike

arXiv:2506.10139·cs.CL·January 28, 2026

Unsupervised Elicitation of Language Models

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces an unsupervised algorithm called Internal Coherence Maximization (ICM) that fine-tunes pretrained language models using their own generated labels, eliminating the need for external supervision and outperforming human-supervised methods on several tasks.

Contribution

The paper presents ICM, a novel unsupervised fine-tuning method for language models that matches or exceeds performance of supervised training, especially on superhuman capability tasks.

Findings

01

ICM matches performance of training on golden labels.

02

ICM outperforms training on crowdsourced labels.

03

ICM improves training of frontier language models.

Abstract

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden labels and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

+ Clever, self-justifying idea: if the model already “knows” a concept, use that signal instead of noisy human labels. The objective is clean and intuitive. + The framework is modular: predictability term + logical consistency + simple search/repair loop. + Salience analysis is honest and useful (the method fails when the concept isn’t in the model). + Early signs the approach can scale (reward modeling / RL) rather than being just a small-bench trick.

Weaknesses

- Scope/generalizability unclear. Most demonstrations look like binary or pairwise decisions (true/false, better/worse). It’s not clear how the objective behaves with non-binary targets. The paper reads a bit specialized to “logical-consistency-style” problems. - Missing self-rewarding/self-training baselines. For a claim of “unsupervised elicitation,” comparisons to modern self-rewarding / RLAIF-style methods (LM-as-judge or LM-derived rewards), and simple self-training with confidence filters

Reviewer 02Rating 4Confidence 3

Strengths

This paper stands out for its originality and surprisingly strong results. The idea of training LMs without any human labels—using Internal Coherence Maximization to find logically consistent, self-generated labels—is both simple and powerful. The experiments convincingly show that ICM can match or beat human-supervised baselines and even train a Claude 4 assistant competitively. The method feels timely and meaningful as models grow beyond human supervision, and the authors back it up with clear

Weaknesses

The main limitation is that ICM’s success depends heavily on how well the underlying model already understands the target concept. When the concept isn’t salient, the method collapses to random guessing. The paper could also do more to explain why mutual predictability works so well—right now it feels more empirical than theoretical. In addition, using closed models like Claude limits reproducibility and makes it hard to verify the claimed parity with human-supervised training.

Reviewer 03Rating 6Confidence 3

Strengths

1. Well-motivated and important problem: studying how to improve language models without human supervision is an important topic in the field, especially for hard tasks such as math and scientific research. 2. Clear idea and simple methodology that works well: the proposed ICM method is conceptually neat and seems easy to implement. Meanwhile the results are pretty strong given this intuitive approach. 3. Broad evaluation: the authors conduct many experiments and ablations to study ICM and show

Weaknesses

**The "superhuman" framing and claim is problematic given the evaluation method** - Why are GSM8K, TruthfulQA, and Alpaca used as proxies for superhuman supervision tasks? Why not include harder/cleaner/less-contaminated "superhuman" benchmarks (e.g., MATH, GPQA, AIME, etc.) if the central claim is eliciting beyond human-quality capabilities? - The "superhuman capability" demonstration uses a gender prediction task. This seems more like a patten matching task instead of complex reasoning, and in

Code & Models

Repositories

zhaoolee/garss
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning and Data Classification