Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts
Buze Zhang, Jinkai Tao, Zilang Zeng, Neil He, Ali Maatouk, Menglin Yang, Rex Ying

TL;DR
This paper introduces MoSLoRA, a novel parameter-efficient fine-tuning method for LLMs that uses a mixture of geometric spaces to learn richer representations, improving performance on various benchmarks.
Contribution
It proposes a unified framework combining multiple geometric spaces for better representation learning and extends LoRA with heterogeneous geometric experts for dynamic space selection.
Findings
MoSLoRA outperforms baselines with up to 5.6% improvement on MATH500.
Achieves up to 15.9% improvement on MAWPS.
Provides insights into curvature optimization effects.
Abstract
Large Language Models (LLMs) have achieved remarkable progress, with Parameter-Efficient Fine-Tuning (PEFT) emerging as a key technique for downstream task adaptation. However, existing PEFT methods mainly operate in Euclidean space, fundamentally limiting their capacity to capture complex geometric structures inherent in language data. While alternative geometric spaces, like hyperbolic geometries for hierarchical data and spherical manifolds for circular patterns, offer theoretical advantages, forcing representations into a single manifold type ultimately limits expressiveness, even when curvature parameters are learnable. To address this, we propose Mixture of Space (MoS), a unified framework that leverages multiple geometric spaces simultaneously to learn richer, curvature-aware representations. Building on this scheme, we develop MoSLoRA, which extends Low-Rank Adaptation (LoRA)…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a well-motivated observation that existing PEFT methods assume flat Euclidean geometry, which may be suboptimal for hierarchical or cyclic semantic structures. MoSELoRA bridges this gap by modeling curvature diversity through a mixture of geometric experts. 2. The lightweight routing and space-mapping method achieves over 4x speedup over standard exp–log scheme 3. The curvature evolution analysis demonstrates interpretable geometry adaptation (Fig. 2): lower transformer lay
1. Experiments are limited to the Qwen2-1.5B model. There is no analysis on how this method scales for larger models or different architecture 2. Efficiency is only reported in terms of runtime, it is unclear if there is any memory overhead or advantages 3. The motivation for employing the curvature spaces (hyperbolic, spherical, Euclidean) is well supported by prior literature. However, the paper does not empirically justify the necessity of using all three simultaneously. Table 3 compares s
1) Mixture of Space (MoS) considers more kinds of constant curvature spaces with learnable Gaussian curvature. 2) Ablation experiments make sense. 3) A comprehensive introduction to the references is given.
Quality: 1) The proposed method has a serious flaw in its geometric interpretation, manifested as structural inversion. Specifically, linear layer transformations should be performed in the embedding space, and vector additions should be carried out in the full space to present a clear geometric meaning, not the other way around. Further, no experiment verifies the advantages of learnable curvature families over fixed curvature families, and no theoretical evidence indicates any advantage of in
1. The idea of the mixture of spaces is interesting. 2. The development of the lightweight token routing mechanism and the unified mapping for three spaces is interesting. 3. The proposed simplification achieves an acceleration of the computation for the geometric mapping.
1. The overall scope of the experiments is somewhat limited in terms of the evaluated models, model sizes, datasets, and tasks. Expanding the experimental coverage would strengthen the empirical claims. 2. The paper remains somewhat vague in several important aspects. The clarity and presentation could be improved by providing more details, such as: - how each expert is selected during the forward pass - how the routing value is computed for each token and expert - what the auxiliary loss is - w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
