Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage
Chenguang Wang, Xiaoyu Zhang, Kaiyuan Cui, Weichen Zhao, Yongtao Guan, Tianshu Yu

TL;DR
This paper introduces Importance Weighted Score Matching, a novel training method for diffusion-based neural samplers that improves mode coverage by directly optimizing an objective akin to the forward KL divergence, especially in data-scarce scenarios.
Contribution
It proposes a principled importance weighted score matching approach that enhances mode coverage in neural samplers without relying on target samples, backed by theoretical analysis and empirical validation.
Findings
Outperforms existing neural samplers on complex multi-modal distributions.
Achieves state-of-the-art results on benchmarks with up to 120 modes.
Provides theoretical insights into bias and variance of the estimator.
Abstract
Training neural samplers directly from unnormalized densities without access to target distribution samples presents a significant challenge. A critical desideratum in these settings is achieving comprehensive mode coverage, ensuring the sampler captures the full diversity of the target distribution. However, prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives. Such objectives inherently exhibit mode-seeking behavior, potentially leading to incomplete representation of the underlying distribution. While alternative approaches strive for better mode coverage, they typically rely on implicit mechanisms like heuristics or iterative refinement. In this work, we propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence, which is conceptually known to encourage…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The writing is OK, there are many theorems which make it look good. The paper is long and covers many details.
**Fundamental Error:** The iDEM objective $$L_{\text{IDEM}}(\theta) = \iint || s_{\theta}(x_t, t) - \nabla_{x_t} \log p_t(x_t) ||^2 p_t^{B} (x_t) dx_t \ dt$$ is not the fisher divergence between $p_t^B$ and $p_t^\theta$ (line 1326 in the paper), since the score inside the L2 norm is $p_t$. The Fisher divergence between $p_t^B$ and $p_t^\theta$ is $$FD = \iint || s_{\theta}(x_t, t) - \nabla_{x_t} \log p^B_t(x_t) ||^2 p_t^{B} (x_t) dx_t \ dt.$$ Why is this important? Since in the IDEM, when th
1. The topic of samplers assisted with generative models is timely. 2. The theoretical analysis extends to bounds on the variance which is a known limitation of IS (but does not include the scaling in dimension).
3. The method is a rather limited modification of the existing approach iDEM. The numerical tests while showing marginal improvements in terms of absolute value on the reported metrics, these are very probably marginal in terms of the accuracy in estimating physical observables (see next point). 4. The numerical evidence for the proposed approach over existing ones is limited. - The approach relies on importance sampling which is expected to suffer from a curse of dimensionality but no syste
* The paper tackles an important problem
* The method is fundamentally unscalable with respect to dimensionality due to the reliance on importance sampling. (1) The empirical approximation of the buffer distribution does not scale (estimating the density of a dataset of samples is itself a challenging topic in probabilistic modeling) (2) It is well known that the IS-based score estimate (Eq. 7) suffers from extremely high variance when $t$ is close to $T$, and the same issue arises for the marginal density estimate (see [1], a crucial
1. The writing is generally easy to follow. 2. The performance of the algorithm, compared to FAB, IDEM. DIKL, is promising. 3. The paper not only provide empirical results, but also derive theorical bounds for the proposed algorithm.
1. Limited contribution; key limitations unaddressed. The core contribution of this paper is to estimate an importance weight for the score matching objective, which is an natural but limited extension to iDEM. iDEM suffers from (i) high-variance score estimates under the target-score identity, (ii) inefficiency in terms of energy evaluations, and (iii) mode-balance issues (i.e., for multi-mode target, the obtained sample cannot reflect the true weight for these modes). The proposed method does
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
