Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Tong Wu; Yutong He; Bin Wang; Kun Yuan

arXiv:2511.09323·cs.LG·November 13, 2025

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Tong Wu, Yutong He, Bin Wang, Kun Yuan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Mixture-of-Channels, a new FFN architecture for large language models that reduces activation memory and improves efficiency during training and inference without sacrificing performance.

Contribution

The paper proposes Mixture-of-Channels, a novel FFN design that selectively activates channels based on relevance, addressing activation memory bottlenecks in LLMs.

Findings

01

Significant reduction in activation memory during pre-training.

02

Improved inference throughput and efficiency.

03

Maintained competitive model performance.

Abstract

Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Clarity: The paper clearly describes how MoC can be implemented as a more efficient FFN layer using existing SwiGLU gating values. 2. Efficiency benefit: The approach can reduce activation and gradient memory during training and reduce inference latency on a single token. 3. Compatibility: The proposed model can be integrated with modern optimized LLM kernels and systems.

Weaknesses

1. Incremental novelty: MoC is conceptually very close to CATS (Lee et al., 2024), differing mainly in using topK instead of threshold. 2. Fixed-K limitation: The Top-K value is globally fixed across tokens, even though different tokens may require varying numbers of active channels. This may underutilize model capacity or cause redundancy for simple tokens. 3. Lack of accuracy validation: The paper mainly evaluates on memory and throughput metrics. The datasets used are simple, and the models a

Reviewer 02Rating 4Confidence 4

Strengths

1. The idea of inducing sparse computation during pretraining is reasonable, and the memory footprint is significantly reduced. 2. Experiments are conducted on diverse model structures. 3. The paper is well written and easy to follow.

Weaknesses

1. The end-to-end latency speedup is fair on a single batch setting, which may also be minor under the batching inference. 2. The top-k-based training may induce instability due to the indifferentiable characteristics. I wonder how it compares with those softened masking methods. 3. There is no module-wise ablation study or profiling for the design kernel. I suggest a detailed profiling for the designed kernel.

Reviewer 03Rating 4Confidence 3

Strengths

* The paper is clear and well structured. * The empirical results present compelling evidence in support of the proposed method. * Developing a kernel optimized for MoC on specific hardware is highly impressive. In many cases, the transition from a strong theoretical concept to a practical implementation ends at the stage of custom optimization. The fact that the authors went further to implement this kernel demonstrates a valuable contribution toward making MoC practically applicable.

Weaknesses

* Theorem 1 isn't very clear - if $b\geq a$, $d_{moc}$ could exceed $f_{ffn}$ which would reduce efficiency. Claiming that MoC is as at least as good as dense FFN does not make a lot of sense. Typically, improving efficiency comes at the cost of reduced expressive power. An interesting direction would be to analyze how closely the model can approximate the original function given a certain efficiency constraint.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling