RadixMLP -- Intra-batch Deduplication for Causal Transformers

Michael Feil; Julius Lipp

arXiv:2601.15013·cs.LG·January 22, 2026

RadixMLP -- Intra-batch Deduplication for Causal Transformers

Michael Feil, Julius Lipp

PDF

Open Access

TL;DR

RadixMLP is a novel method that reduces redundant computations in causal transformer inference by exploiting shared prefixes, leading to significant speedups in reranking workloads.

Contribution

RadixMLP introduces a prefix trie-based approach to eliminate intra-batch redundancy in MLP computations during inference.

Findings

01

Achieves 1.44-1.59× speedup on real workloads

02

Up to 5× speedup on synthetic benchmarks with long shared prefixes

03

Operates within a single forward pass without statefulness

Abstract

Batch inference workloads for causal transformer models frequently process sequences that share common prefixes, such as system prompts, few-shot examples, or shared queries. Standard inference engines treat each sequence independently, redundantly recomputing identical MLP activations for every copy of the shared prefix. We introduce RadixMLP, a technique that exploits the position-wise nature of MLPs, LayerNorms, linear projections, and embeddings to eliminate this redundancy. RadixMLP dynamically maps batches to a prefix trie, gathering shared segments into a compressed representation for position-wise computation and scattering results back only at attention boundaries. RadixMLP is stateless and operates within a single forward pass. In end-to-end serving benchmarks on MS~MARCO v1.1 with Qwen3 models (0.6B to 8B parameters), RadixMLP achieves 1.44-1.59 $\times$ speedups in realistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Scientific Computing and Data Management