Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

Ruofeng Yang; Yongcan Li; Bo Jiang; Cheng Chen; Shuai Li

arXiv:2601.01475·cs.LG·January 6, 2026

Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel multi-subspace mixture of Gaussians modeling approach for diffusion models, capturing multi-modal data structures, improving estimation error bounds, and providing convergence guarantees, explaining their efficiency with small datasets.

Contribution

The paper proposes MoLR-MoG modeling for diffusion models, capturing multi-modal and multi-manifold data, with theoretical error bounds and convergence analysis, advancing understanding of diffusion model efficiency.

Findings

01

MoE-latent MoG NN outperforms MoE-latent Gaussian score in generation quality.

02

MoE-latent MoG NN achieves comparable performance with fewer parameters.

03

Theoretical error bound escapes the curse of dimensionality.

Abstract

Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{- 1/ D}$ with the data dimension $D$ . Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/ n$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ( $n_{k}$ modals with dimension $d_{k}$ ). With this modeling, the corresponding score function…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. MoLR-MoG is a novel generative prior that generalizes prior single-subspace Gaussian latent assumptions. The nonlinear MoG score derivation and its MoE interpretation are new theoretical contributions. 2. The paper delivers rigorous generalization bounds and optimization theory that are tight up to log factors and explicit in all problem constants.

Weaknesses

1. Experiments are small-scale (MNIST/CIFAR-10). No evaluation on high-resolution images (e.g., ImageNet) or complex datasets (e.g., text-to-image, multi-resolution). FID, LPIPS, or human evaluations are absent; only visual samples are shown. Additionally, the time comparison of the proposed new diffusion model with the original diffusion is lacking. 2. Theoretical guarantees rely on Δ ≫ γ_t (Assumption 6.1) and highly separated Gaussians (Assumption 6.6). No discussion of relaxation regimes or

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper introduces a novel MoLR-MoG modeling framework that is applied to both the training data distribution and the network architecture design. 2. Empirically, the authors validate the effectiveness of this modeling by demonstrating performance comparable to that of the standard U-Net architecture. 3. Theoretically, they show that the MoLR-MoG framework mitigates the curse of dimensionality and establishes local strong convexity in the loss landscape. These theoretical results contribu

Weaknesses

1. As a theoretically tractable model, MoLR-MoG inevitably exhibits a gap from real image distributions. As shown in Figure 2, there remains a noticeable discrepancy between CIFAR-10 images generated by MoLR-MoG and those from the true CIFAR-10 dataset. 2. The assumption of highly separated Gaussian components (Assumption 6.6) is unrealistic for real-world image data. In practice, different image classes often share overlapping or correlated semantic subspaces. Consequently, the condition requi

Reviewer 03Rating 6Confidence 5

Strengths

(1) **Novel latent modeling framework**: The paper introduces MoLR-MoG, which combines low-dimensional subspace modeling with a mixture-of-Gaussian latent structure, enabling diffusion models to capture multi-modal and nonlinear latent structures more effectively than prior Gaussian-based approaches. (2) **Theoretical contributions**: It provides provable estimation error bounds that show how the model can escape the curse of dimensionality, and also establishes convergence guarantees for gradi

Weaknesses

Here are some concerns about this paper: (1) **Inconsistency between experiments and theoretical results**: In Section 3.1, the authors derive a score network architecture based on the MoLR-MoG model. However, in Section 4, they train 10 VAEs to serve as encoders and decoders before applying the network architecture. This experimental setup appears to deviate from the theoretical framework, and it is unclear how it aligns with the assumptions and analysis presented in Section 3. (2) **Comparis

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Generative Adversarial Networks and Image Synthesis