CAMEx: Curvature-aware Merging of Experts
Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S.Y. Teo, Tan M., Nguyen, Linh Duy Tran

TL;DR
CAMEx introduces a curvature-aware expert merging method using natural gradients, improving model generalization and efficiency during pre-training and fine-tuning of large language models without high memory costs.
Contribution
It presents a novel expert merging protocol that incorporates natural gradients to account for parameter space curvature, enhancing performance and resource efficiency.
Findings
Outperforms Euclidean merging techniques on NLP tasks
Reduces computational costs while maintaining high performance
Provides theoretical and empirical validation of the method
Abstract
Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach…
Peer Reviews
Decision·ICLR 2025 Poster
1. Originality CAMEx introduces a novel approach by leveraging natural gradients to accommodate the non-Euclidean geometry of the parameter space, moving beyond traditional Euclidean-based merging methods. This curvature-aware design aligns model updates more naturally with the parameter manifold’s structure, enhancing both optimization and generalization. Additionally, the dynamic merging architecture optimizes resource usage without performance loss, offering a new direction for efficiently sc
While CAMEx introduces a novel and promising curvature-aware approach, several areas could be improved to enhance the paper's rigor and impact. 1. Limited Baseline Comparisons The paper primarily compares CAMEx with standard Euclidean-based and a few curvature-aware merging methods but lacks comparisons with a broader set of state-of-the-art SMoE techniques, especially recent adaptive or gradient-based approaches. Including a wider range of baselines would better contextualize CAMEx’s strengths
- The curvature-aware approach consistently outperforms traditional Euclidean-based merging methods across different tasks and architectures - The dynamic merging architecture achieves the same number of experts but reduces FLOPs per token, providing a practical way to improve efficiency - The Kronecker-based approximation for the curvature matrix is computationally practical and empirically effective based on ablation studies - Strong theoretical analysis in Section 2.6 shows how gradients w.r.
- Technical aspects of the method need clearer explanation - the causal segmenting approach lacks sufficient background/motivation in the main paper, key equations for merging need step-by-step walkthrough, and curvature-based updates require more detailed explanation - Performance improvements are somewhat modest, especially on the GLUE benchmark tasks - Training for longer appears to reduce the performance gap between proposed methods and baselines (Figure 3) - Impact on larger model scales is
Curvature-aware merging is an interesting idea for expert merging.
1. In the section 3.2. "Ties-CA achieves the highest scores on SST-2 (94.61), MRPC (92.49), CoLA (60.06), and MNLI (86.45), showing significant improvements over both the vanilla and standard Ties models." Actually, the experiments in Table 2 don't show any significantance test. Could you present the significance test(T-test) of the improvement? Otherwise, you should reclaim these conclusions. 2. The experiment results are not strong enough to prove the effectiveness of CAMEx. Could you show s
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Data Quality and Management
