Plug-and-Play Controllable Generation for Discrete Masked Models
Wei Guo, Yuchen Zhu, Molei Tao, Yongxin Chen

TL;DR
This paper introduces a versatile plug-and-play importance sampling framework for controllable discrete masked model generation, enabling efficient, task-agnostic sample control without additional training.
Contribution
It proposes a novel importance sampling method that allows control over discrete masked models without task-specific fine-tuning or gradient-based methods.
Findings
Effective in posterior sampling and constrained generation
Applicable to protein design and image generation
Outperforms existing methods in efficiency and flexibility
Abstract
This article makes discrete masked models for the generative modeling of discrete data controllable. The goal is to generate samples of a discrete random variable that adheres to a posterior distribution, satisfies specific constraints, or optimizes a reward function. This methodological development enables broad applications across downstream tasks such as class-specific image generation and protein design. Existing approaches for controllable generation of masked models typically rely on task-specific fine-tuning or additional modifications, which can be inefficient and resource-intensive. To overcome these limitations, we propose a novel plug-and-play framework based on importance sampling that bypasses the need for training a conditional score. Our framework is agnostic to the choice of control criteria, requires no gradient information, and is well-suited for tasks such as…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* The paper tackles a broad category of problem; namely plug-and-play conditional generation using discrete masked models without the need for fine-tuning. Additionally, they lay out in which settings their methodology would be advantageous (for example, they indicate that this method is useful when evaluating the masked model is much more expensive to evaluate than the reward function). * The authors make a good effort at making the paper reproducible by including source code of the algorithm (
__Theoretical Concerns__: * Several key aspects of the paper lack a theoretical justification or are not derived in a principled manner. For example, the proposed reward equation $r(x) = \exp\left({-\sum w_i \text{dist}(m_i(x), A_i)^{\alpha_i}}\right)$ is provided with no theoretical grounding or explanation. As best I can tell, the definition of the sampling distribution $q(z) = Z^{-1} r(x)p(x)$ would require $r(x) \geq 0$ in order for $q(x)$ to be a valid distribution. However, this is not me
Overall the paper is very well written. The motivation of the problem, controllable discrete masked model generation without training, is good, as this implies flexible controllable generation without additional computational overhead of training for each controlled generation task. The theory appears to be sound to me without any errors, arriving at the mean field approximation with importance sampling, which seems to be a reasonable approach and yields decent results on both the toy task and p
No limitations are presented in the paper, and it seems like there may be some worth discussing. One is reward function design, as it's unclear whether some tasks may not have difficult to design reward functions or if there's a high dependence on reward function on success. The next is that the Monte Carlo samples seem to be quite high, the performance in figure 10 seems to indicate that even at 10k samples the model is still improving. There really should be more of a discussion about this lim
The paper presents a well-justified method from conditional sampling the presence of a reward function. It details the assumptions it makes and it gives an intuition when/why someone would use this method for conditioning. In terms of novelty, SIR is not novel, but its application to masked generative models for controllable generations is. I am not aware of other works that use this idea for masked generative models. The paper is very well written and easy to understand. The motivation is cle
A main weakness of the paper is the experimental results. The work is motivated by the versatility of the approach: they claim strong performance across multiple domains. However, experimental results only include protein generation benchmarks. There are not experiments on text, images or audio with the masked models that are discussed in the introduction. Regarding the protein benchmarks, there is no baseline to compare against and there are no ablation experiments. * Baselines: It would be go
* The paper is generally well-written, and clear * The method is relatively simple, and easy to implement * The method only requires an unconditional model, and can be used to controllably generate from any conditional distribution given its corresponding reward function
* The novelty is relatively low, as importance sampling has been very well studied in prior works. Although to my knowledge, I have not seen it applied it in the context of controllable generation, the experiments do not well demonstrate the effectiveness of the proposed method * Core experiments are on relatively easy (low-dim) distributions, and it is unclear as to how this method scales. How well does the method work for more complex distributions, e.g. for images, longer sequence proteins, e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation
