TIDE: Every Layer Knows the Token Beneath the Context
Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

TL;DR
TIDE enhances transformer-based language models by maintaining token-specific memory across layers, addressing rare token and contextual collapse issues, leading to improved performance.
Contribution
The paper introduces EmbeddingMemory, a novel memory mechanism that preserves token identities throughout the model, improving upon the standard single-injection approach.
Findings
TIDE reduces the rare token problem in language models.
TIDE improves model performance on multiple language tasks.
EmbeddingMemory enhances token distinction across layers.
Abstract
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
