Product of Experts for Visual Generation
Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu

TL;DR
This paper introduces a training-free Product of Experts framework that combines diverse visual models at inference time, improving controllability and flexibility in image and video generation tasks.
Contribution
It presents a novel inference-time knowledge composition method using PoE and AIS, enabling integration of heterogeneous models without additional training.
Findings
Enhanced controllability in image and video synthesis.
Flexible user interfaces for specifying visual goals.
Improved quality over monolithic models.
Abstract
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper is clearly presented, with sophisticated structure and logical flow. 2. It addresses an interesting and significant problem: how to effectively utilize different experts for efficient and controllable visual generation. 3. The paper presents detailed experimental results, showing improvement compared to baselines. 4. It also demonstrates strong visual results with clear corresponding explanations.
Method and motivation — Sections 3.1 and 3.2 emphasize that, to sample from the product of a set of generative experts, the paper uses Markov Chain Monte Carlo to iteratively refine samples based on their likelihood under the product distribution (3.1), and employs AIS and SMC to draw samples from the product-of-experts distribution. However, there still exists concern about the intrinsic sharpness of the product-of-experts energy landscape, even with intermediate tempered distributions and part
The proposed method achieves visually satisfactory generative results. The paper is well-structured and easy to follow.
The performance improvements yielded by the proposed approach is kind of small, compared to the benchmarking algorithms, such as Depth2V for the setting of Object-Centric Simulation Input.
1. The paper is well-written and easy to follow. 2. The proposed method is general and works well with different expert models across different modalities. 3. The generated results can show better quality against the competing baseline, e.g., in Figure 2 the proposed method preserves the original image more faithfully.
1. The proposed method is more heuristic than it appears to be. The main novelty is applying some tweaks to well-established sampling methods when discriminators are involved, which work well on the modern large expert models. At some point, this feels like a glorified classifier-based guidance with no theoretical guarantee provided. Maybe the authors can tone down a bit. 2. The model performance is not always on the better side. This includes both the metrics across the tables and the visual ex
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
