On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Ivan Bondarenko; Egor Palkin; Fedor Tikunov

arXiv:2602.18301·cs.LG·February 23, 2026

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

Ivan Bondarenko, Egor Palkin, Fedor Tikunov

PDF

Open Access

TL;DR

This paper investigates the semantic and syntactic information encoded in proto-tokens used for one-step text reconstruction in large language models, revealing their properties and how to impose semantic structure without losing reconstruction accuracy.

Contribution

It provides a detailed analysis of proto-token content, stability, and attention patterns, and introduces regularization schemes to embed semantic structure into proto-tokens.

Findings

01

m-token captures more semantic information than e-token

02

Anchor-based constraints reduce reconstruction accuracy

03

Relational distillation transfers semantic relations without harming reconstruction

Abstract

Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for "imposing" semantic structure on the e-token using teacher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms