FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
Wilhelm Tranheden, Shahnawaz Ahmed, Devdatt Dubhashi, Jonna Matthiesen, Hannes von Essen

TL;DR
FlashHead is a novel, training-free, hardware-efficient replacement for language model classification heads that significantly speeds up inference while maintaining accuracy, enabling smaller models for consumer devices.
Contribution
It introduces a retrieval-based approach with innovative clustering, multi-probe retrieval, probabilistic sampling, and quantization to replace dense classification heads in language models.
Findings
Achieves up to 1.75x inference speedup on multiple models.
Maintains output accuracy comparable to original heads.
Establishes new benchmarks for efficient language model inference.
Abstract
Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads,…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. Originality. Equal-size spherical clustering for the head plus aggressive multi-probe and probabilistic sampling is novel and tailored to accelerators. 2. Strong evidence: near-perfect top-$k$ containment and consistent model-level speedups; int4 stage-1 retains accuracy. 3. Method and complexity are explicit; pseudocode and ablations make design choices convincing. 4. Significance. Reduces a major bottleneck for SLMs; drop-in and training-free lowers adoption barriers.
1. Likelihoods & evaluation. No closed-form full-vocabulary distribution; relies on Monte-Carlo for likelihood metrics. 2. Equal-size constraint. Requires $c\mid v$; effect on semantic purity of clusters vs. unequal sizes could be further theorized. 3. Deployment knobs. Sensitivity of $(c,p)$ and memory overheads (centroids, C2T map) under tight device budgets could be quantified more deeply.
- Clear algorithmic description and implementation details. Hyperparameters are fully specified and the two stage process is well explained with neat figures. - The paper includes diverse benchmarks with different model families along with latency measurements showing practical speedups. - A fast training-free approach with near perfect top-1 containment. - Equal sized clustering helps efficient memory access by avoiding ragged tensor operations.
**Literature Survey is Generally lacking** - Sample relevant works not discussed. - HALOS: Hashing Large Output Space for Cheap Inference ([paper](https://proceedings.mlsys.org/paper_files/paper/2022/file/b059dd6da6b9a86180fbc32a799766cc-Paper.pdf)) - HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference ([paper](https://arxiv.org/abs/2402.09360)) - VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits ([paper](https://arxiv.
(S1) Timely and important topic. While most papers focus on the efficiency of large language models (LLMs), improving the efficiency of SLMs can help democratize the use of powerful AI models. (S2) Sound empirical results. The results show that FlashHead gives up to 1.75x of speedup, while the impact on model accuracy is minor.
(W1) Limited technical contributions The 4 key innovations are built upon existing literature and the paper does not sufficiently clarify the insights that they could introduce. - Equal-sized Clustering: As mentioned in this paper, using K-means to make the vocab compact has been proposed in existing literature, and this work only makes a difference that it requires all clusters to be equal-sized. For me, it is primarily an implementation-level optimization. The paper presents it as a key innov
The method effectively improves the efficiency of the LM head computation, which constitutes a significant portion of inference cost in small and medium-sized language models (SLMs), without any additional training cost. By enforcing equal-sized clustering within the LM head, the approach achieves a hardware-friendly design that balances GPU workloads and minimizes accuracy degradation. The proposed procedure — spherical k-means combined with multi-probe retrieval — is conceptually simple, eas
The experiments are somewhat fragmented across different axes, making it difficult to assess the overall advantage of FlashHead in terms of cost, performance, and latency. A unified comparison table against retraining-based methods would strengthen the paper’s quantitative clarity. It is unclear whether the comparisons were made under equivalent experimental conditions. Since hyperparameters such as clustering size and the number of probes can significantly affect performance, the paper should
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Explainable Artificial Intelligence (XAI)
