TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal; Lauren Hannah; Han-Byul Kim; Duc Hoang; Mehrdad Farajtabar; Minsik Cho

arXiv:2605.06216·cs.CL·May 8, 2026

TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

PDF

TL;DR

TIDE enhances transformer-based language models by maintaining token-specific memory across layers, addressing rare token and contextual collapse issues, leading to improved performance.

Contribution

The paper introduces EmbeddingMemory, a novel memory mechanism that preserves token identities throughout the model, improving upon the standard single-injection approach.

Findings

01

TIDE reduces the rare token problem in language models.

02

TIDE improves model performance on multiple language tasks.

03

EmbeddingMemory enhances token distinction across layers.

Abstract

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.