Leviathan: Decoupling Input and Output Representations in Language Models
Reza T. Batley, Sourav Saha

TL;DR
Leviathan introduces a novel Transformer architecture that decouples input and output representations, leading to improved language modeling performance, especially on rare tokens, with minimal additional parameters.
Contribution
The paper presents Leviathan, a new method replacing input embeddings with learned vectorization, enhancing performance over standard tied embeddings with minimal parameter increase.
Findings
Leviathan reduces validation perplexity by 9% at 1.2B scale.
It requires 2.1 times fewer tokens to reach baseline loss.
Achieves a 30% reduction in LAMBADA perplexity.
Abstract
Modern language models use a single matrix for input embedding and output projection. This couples two distinct objectives: token representation and discrimination over a vocabulary. This work introduces Leviathan, a Transformer architecture that replaces the input embedding matrix with learned embedding vectorization (LEV), a compact continuous mapping from token indices to embeddings. Leviathan's output head remains untied for a parameter increase of as low as 0.2%. Under controlled comparisons with identical Transformer backbones, Leviathan consistently improves language modeling performance over standard tied-embedding baselines across a 200M-1.2B parameter regime on The Pile with gains that grow during training. At 1.2B scale, Leviathan reduces validation perplexity by 9%, requires fewer training tokens to reach the tied baseline's final loss, and improves on all six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
