TL;DR
X-GRAM introduces a memory-efficient, data-aware embedding extraction method that enhances accuracy and scalability in large token lookup tables by decoupling model capacity from compute.
Contribution
It proposes a novel frequency-aware dynamic token-injection framework with hybrid hashing and local n-gram features, improving parameter efficiency and scalability.
Findings
X-GRAM improves accuracy by up to 4.4 points over baseline models.
It reduces memory usage by 50% while maintaining performance.
Extensive evaluations demonstrate superior scalability at 0.73B and 1.15B scales.
Abstract
Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
