Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

Yu-Ang Cheng; Leyang Hu; Hai Huang; Randall Balestriero

arXiv:2505.17169·cs.CL·May 26, 2025

Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

Yu-Ang Cheng, Leyang Hu, Hai Huang, Randall Balestriero

PDF

3 Reviews

TL;DR

This paper introduces the Next Token Perception Score (NTPS), a metric to evaluate how well autoregressive language model representations align with perception tasks, correlating strongly with downstream performance and aiding in fine-tuning assessments.

Contribution

The paper proposes NTPS, a novel analytical metric for measuring the alignment between autoregressive representations and perception tasks, validated across multiple models and datasets.

Findings

01

NTPS correlates strongly with linear probe accuracy.

02

LoRA fine-tuning increases NTPS, improving perception alignment.

03

NTPS predicts gains from LoRA fine-tuning.

Abstract

Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The proposed metric and its derivation seem pretty clear. It is essentially a subspace alignment score between the frobenius norm of the perception encoder U that lies inside the next token subspace spanned by V. - The metric seems to be well-correlated with downstream performance across different models.

Weaknesses

> Takeaway: Linear probing on pretrained LLM representations can outperform, match, or underperform full-training from scratch. - Agreed that the linear probing technique is indeed noisy, but Table 1 and this claim seem to be slightly misleading. These linear probes are used as a way to approximate how good the model are at the downstream tasks like Emotion, etc, so a better study seems to be how well the linear probes correlate to the full finetuning performance (when taking different checkpoin

Reviewer 02Rating 4Confidence 3

Strengths

1. The work proposes NTPS as a novel metric for measuring the misalignment between perception and next-token prediction objectives, addressing an important gap in understanding of pretrained LLMs’ limited transferability to downstream tasks. 2. The paper includes comprehensive and extensive experimental results : (1) Table 1 demonstrates that linear probing can outperform, match, or underperform full training from scratch, establishing the motivation for the work (2) Figure 2 shows consistent c

Weaknesses

1. The paper claims that misalignment between perception and autoregressive spaces arises primarily from the next-token prediction loss during pretraining (lines 54-59, Section 3.1). However, other confounding factors could contribute to this phenomenon, including (1) pretraining data size and distribution mismatches with downstream tasks (2) optimization dynamics and implicit biases. 2. Related to the first point, the paper does not adequately control for or discuss these alternative explanati

Reviewer 03Rating 4Confidence 3

Strengths

The paper is nicely organized and clearly motivated. The proposed NTPS metric provides an intuitive geometric perspective on alignment and the theoretical section is clear and builds good intuition. The experiments cover a wide range of models and datasets and NTPS shows convincing correlation with MSE loss and additional accuracy gains from LoRA.

Weaknesses

All reported results are based on rank correlations, which makes it hard to interpret what the metric actually means in practice. If I have a model and a downstream task and compute an NTPS score, how should I interpret its magnitude? The paper doesn’t provide guidance on what constitutes a "high" or "low" score, which limits its usefulness. The claim on syntactic vs semantic groupings (as in Fig 1) is nice for intuition but does not seem rigorous based on comparing just the top 2 eigenvalues.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.