ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang

TL;DR
ConceptMoE introduces a dynamic token merging approach that adaptively compresses sequences into concept representations, enabling more efficient and effective large language models with improved performance and speedups.
Contribution
It proposes a novel adaptive token-to-concept compression method for MoE models, improving efficiency and performance across language and vision tasks.
Findings
Outperforms standard MoE in multiple benchmarks
Reduces attention computation and KV cache usage significantly
Achieves notable speedups in prefill and decoding times
Abstract
Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
