ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng

TL;DR
ExpertWeaver is a novel method that leverages GLU activation patterns in dense LLMs to efficiently convert them into high-performing sparse MoEs without additional training.
Contribution
It introduces a training-free framework that uncovers inherent MoE structures in dense models using GLU activation patterns, improving conversion quality.
Findings
Outperforms existing dense-to-MoE conversion methods.
Effectively identifies universal and specialized neurons for MoE construction.
Enhances both dynamic pruning and downcycling strategies.
Abstract
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
* The MoE construction algorithm itself is very intuitive. In particular, the method of constructing router weights directly from existing gate weights is simple and rational. However, the act of obtaining neuron activation information from calibration data is a form of machine learning, so calling it "training-free" is likely an overstatement. * Reproducing the method seems easy, and the information necessary for reproduction is fully covered in the paper. Since the proposed method works on mod
* The use of the term "pruning" is arbitrary. Pruning in conventional research refers to methods that reduce the parameters themselves, whereas AbsTopK-GLU and the final goal of a downcycled MoE preserve all parameters and operate adaptively on the input. The objectives of these methods do not align. * Related to the above, the comparisons in Table 1 and Table 2 mix methods that reduce parameters with adaptive methods, making it difficult to find meaning in the comparisons themselves. * It is un
1. Innovative observation that GLU gating encodes neuron-level specialization patterns. 2. Elegant training-free conversion procedure; no additional routing training needed. 3. Consistent performance gains over previous training-free pruning baselines (FLAP, CMoE, LLM-Pruner).
1. Lack of thorough inference-throughput analysis despite claiming efficiency as a key motivation. 2. Missing inference throughput experimental setup details (input length, TP/EP config, number of GPUs, software library information). 3. No comparison with Drop Upcycling or other recent upcycling methods. 4. The baselines are partly outdated and omit dense models released in the same generation as the used base model (Qwen2.5-7B)—for example, Qwen2.5-3B or Llama-3.2-3B—which would provide fairer
1. The core idea is simple leveraging existing GLU structures in dense LLMs yet it proves to be effective, outperforming other existing methods. 2. Demonstrates consistent improvements over LLM-Pruner, FLAP, and CMoE across Qwen2.5-7B and LLaMA3-8B baselines. 3. The paper is clearly written and well-organized, with convincing empirical results supporting its claims.
1. The proposed method can be interpreted as a dynamic, data dependent filtering mechanism. While the paper includes an ablation on diversity, it would be useful to also analyze how data quantity and diversity affect performance sensitivity and stability. 2. Tables 1–2 exclude reasoning and generative benchmarks (e.g., GSM8K, HumanEval). The authors should clarify why these tasks were omitted, as including them would better demonstrate the method’s generality. 3. Details about the training lib
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
