Product of Experts for Visual Generation

Yunzhi Zhang; Carson Murtuza-Lanier; Zizhang Li; Yilun Du; Jiajun Wu

arXiv:2506.08894·cs.CV·October 10, 2025

Product of Experts for Visual Generation

Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a training-free Product of Experts framework that combines diverse visual models at inference time, improving controllability and flexibility in image and video generation tasks.

Contribution

It presents a novel inference-time knowledge composition method using PoE and AIS, enabling integration of heterogeneous models without additional training.

Findings

01

Enhanced controllability in image and video synthesis.

02

Flexible user interfaces for specifying visual goals.

03

Improved quality over monolithic models.

Abstract

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper is clearly presented, with sophisticated structure and logical flow. 2. It addresses an interesting and significant problem: how to effectively utilize different experts for efficient and controllable visual generation. 3. The paper presents detailed experimental results, showing improvement compared to baselines. 4. It also demonstrates strong visual results with clear corresponding explanations.

Weaknesses

Method and motivation — Sections 3.1 and 3.2 emphasize that, to sample from the product of a set of generative experts, the paper uses Markov Chain Monte Carlo to iteratively refine samples based on their likelihood under the product distribution (3.1), and employs AIS and SMC to draw samples from the product-of-experts distribution. However, there still exists concern about the intrinsic sharpness of the product-of-experts energy landscape, even with intermediate tempered distributions and part

Reviewer 02Rating 6Confidence 3

Strengths

The proposed method achieves visually satisfactory generative results. The paper is well-structured and easy to follow.

Weaknesses

The performance improvements yielded by the proposed approach is kind of small, compared to the benchmarking algorithms, such as Depth2V for the setting of Object-Centric Simulation Input.

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. The proposed method is general and works well with different expert models across different modalities. 3. The generated results can show better quality against the competing baseline, e.g., in Figure 2 the proposed method preserves the original image more faithfully.

Weaknesses

1. The proposed method is more heuristic than it appears to be. The main novelty is applying some tweaks to well-established sampling methods when discriminators are involved, which work well on the modern large expert models. At some point, this feels like a glorified classifier-based guidance with no theoretical guarantee provided. Maybe the authors can tone down a bit. 2. The model performance is not always on the better side. This includes both the metrics across the tables and the visual ex

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis