SpeLLM: Character-Level Multi-Head Decoding
Amit Ben-Artzy, Roy Schwartz

TL;DR
SpeLLM introduces a character-level multi-head decoding method that decouples input and output vocabularies, enabling larger output spaces and reducing runtime costs in large language models.
Contribution
It proposes a novel multi-head decoding approach that predicts characters independently, allowing for larger vocabularies without increasing model size, and demonstrates effective conversion from standard LLMs.
Findings
Achieves 5.1% average runtime reduction across models
Maintains competitive performance on downstream tasks
Enables support for underrepresented languages and domains
Abstract
Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention's quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
