Unlocking Full Efficiency of Token Filtering in Large Language Model Training
Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Kaiqiang Xu, Binhang Yuan, Dian Shen, Junxue Zhang, Kai Chen

TL;DR
Centrifuge is a system that significantly accelerates large language model training by optimizing token filtering efficiency through algorithm and system co-design, reducing training time while maintaining model utility.
Contribution
It introduces a novel co-designed approach that enhances token filtering efficiency in LLM training, enabling real-world speedups with minimal code changes.
Findings
Reduces backpropagation time by up to 49.9%
Decreases end-to-end training time by up to 34.7%
Maintains or improves model utility and performance
Abstract
Token filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While usingfewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper clearly identifies why existing token filtering methods fail to achieve efficiency gains (inadequate sparsity propagation and sparsity range mismatch with ML libraries), with solid empirical evidence. 2. The paper evaluates across multiple dimensions: different model sizes, various training scenarios, context lengths, and filtering ratios, demonstrating consistent gains and broad applicability.
1. All experiments use models ≤8B parameters. Given that efficiency gains are most critical for larger models (70B+), the absence of such experiments is a significant limitation. The 8B model uses TP=8, but modern large-scale training uses more complex parallelism (TP+PP+DP). How CENTRIFUGE scales to 70B+ models with TP=8, PP=4 setups remains unknown. 2. The automatic graph updating approach relies on "runtime stability" and special prime number markers, which may be fragile. The paper doesn't d
1. The authors identify a crucial bottleneck (lack of real efficiency gains from token filtering) and provide a novel solution with algorithm and system co-design, effectively addressing both the ML aspect (ensuring gradients remain correct and useful) and the systems aspect (making use of sparsity in existing hardware/software). By further filtering attention activations in the backward pass in a safe way, they increase sparsity where it matters, and by transforming sparse ops to dense via grap
1. The experiments, although extensive for the provided settings, are limited to relatively small-scale dense LLMs and a specific domain (mathematical reasoning). It remains an open question how well CENTRIFUGE would scale to much larger LLMs (e.g. 70B+ params) or to training on more diverse, general-domain data. Larger LLM might introduce new bottlenecks or slightly different execution characteristics (communication load, optimizer overheads) that were not encountered at 3–8B scale. It might al
1. The paper provides a clear diagnosis of why token filtering has not delivered superior training efficiency. The proposed method is intutive and effectively overcomes current limitations. 2. Experiments on fine-tuning foundation models show that CENTRIFUGE preserves the benefits of token filtering and reduces end-to-end training time by 31.7%.
1. Experiments are only conducted on fine-tuning pre-trained foundation models for downstream tasks. However, since the main advantage of the proposed method lies in its training efficiency, it should be validated on computationally intensive pretraining tasks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
