Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts
Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

TL;DR
This paper introduces a novel multi-subspace mixture of Gaussians modeling approach for diffusion models, capturing multi-modal data structures, improving estimation error bounds, and providing convergence guarantees, explaining their efficiency with small datasets.
Contribution
The paper proposes MoLR-MoG modeling for diffusion models, capturing multi-modal and multi-manifold data, with theoretical error bounds and convergence analysis, advancing understanding of diffusion model efficiency.
Findings
MoE-latent MoG NN outperforms MoE-latent Gaussian score in generation quality.
MoE-latent MoG NN achieves comparable performance with fewer parameters.
Theoretical error bound escapes the curse of dimensionality.
Abstract
Recently, diffusion models have achieved a great performance with a small dataset of size and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality with the data dimension . Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of linear subspaces, and each subspace admits a mixture of Gaussian latent ( modals with dimension ). With this modeling, the corresponding score function…
Peer Reviews
Decision·ICLR 2026 Poster
1. MoLR-MoG is a novel generative prior that generalizes prior single-subspace Gaussian latent assumptions. The nonlinear MoG score derivation and its MoE interpretation are new theoretical contributions. 2. The paper delivers rigorous generalization bounds and optimization theory that are tight up to log factors and explicit in all problem constants.
1. Experiments are small-scale (MNIST/CIFAR-10). No evaluation on high-resolution images (e.g., ImageNet) or complex datasets (e.g., text-to-image, multi-resolution). FID, LPIPS, or human evaluations are absent; only visual samples are shown. Additionally, the time comparison of the proposed new diffusion model with the original diffusion is lacking. 2. Theoretical guarantees rely on Δ ≫ γ_t (Assumption 6.1) and highly separated Gaussians (Assumption 6.6). No discussion of relaxation regimes or
1. The paper introduces a novel MoLR-MoG modeling framework that is applied to both the training data distribution and the network architecture design. 2. Empirically, the authors validate the effectiveness of this modeling by demonstrating performance comparable to that of the standard U-Net architecture. 3. Theoretically, they show that the MoLR-MoG framework mitigates the curse of dimensionality and establishes local strong convexity in the loss landscape. These theoretical results contribu
1. As a theoretically tractable model, MoLR-MoG inevitably exhibits a gap from real image distributions. As shown in Figure 2, there remains a noticeable discrepancy between CIFAR-10 images generated by MoLR-MoG and those from the true CIFAR-10 dataset. 2. The assumption of highly separated Gaussian components (Assumption 6.6) is unrealistic for real-world image data. In practice, different image classes often share overlapping or correlated semantic subspaces. Consequently, the condition requi
(1) **Novel latent modeling framework**: The paper introduces MoLR-MoG, which combines low-dimensional subspace modeling with a mixture-of-Gaussian latent structure, enabling diffusion models to capture multi-modal and nonlinear latent structures more effectively than prior Gaussian-based approaches. (2) **Theoretical contributions**: It provides provable estimation error bounds that show how the model can escape the curse of dimensionality, and also establishes convergence guarantees for gradi
Here are some concerns about this paper: (1) **Inconsistency between experiments and theoretical results**: In Section 3.1, the authors derive a score network architecture based on the MoLR-MoG model. However, in Section 4, they train 10 VAEs to serve as encoders and decoders before applying the network architecture. This experimental setup appears to deviate from the theoretical framework, and it is unclear how it aligns with the assumptions and analysis presented in Section 3. (2) **Comparis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Generative Adversarial Networks and Image Synthesis
