RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation
Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

TL;DR
RefAny3D introduces a novel 3D asset-referenced diffusion model that integrates multi-view 3D assets into image generation, enabling consistent and versatile 2D image synthesis aligned with 3D references.
Contribution
It presents a cross-domain diffusion framework with dual-branch perception for joint modeling of 3D assets and images, enhancing reference-based image generation capabilities.
Findings
Effectively leverages 3D assets for image generation
Produces spatially aligned RGB images and point maps
Demonstrates improved consistency with 3D references
Abstract
In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The pipeline that pairs real images with mesh assets (via Hunyuan3D + FoundationPose) to obtain aligned point maps is practically useful for research on 3D-aware image diffusion. 2. Domain-LoRA and Reference-LoRA help the network simultaneously generate point maps and RGB, improving stability and disentanglement. 3. The paper is easy to follow, with sound motivation and diagrams. 4. Ablations and comparisons are extensive; qualitative results show crisper textures and better geometric adhe
1. Potential supervision noise from image-to-3D. The dataset relies on image-to-3D generators, which may not perfectly preserve reference fidelity, injecting bias into the training signal. The paper should quantify how frequently generator artifacts or pose errors degrade the learned 3D-conditioned diffusion, and propose mitigation. 2. It remains unclear how much of the gain comes from generating point maps versus simply conditioning on multi-view images; a rigorous comparison against a “no-poi
Product-wise, this task is very useful for commercializing generative models for rendering different photos of specific object products, mostly like a diffusion model shader.
There is too little technical contribution in this paper. Although the pipeline works, it is a straightforward engineering pipeline. I believe this is very practical for industry and product applications, but far below the standard of an ICLR paper.
1. The paper is clearly structured and well-written, with smooth logical flow from motivation to methodology and results. 2. The work introduces a novel 3D asset-referenced diffusion framework that bridges the gap between 2D reference-based generation and 3D-aware synthesis. 3. The experimental section is comprehensive, including qualitative, quantitative, and ablation studies that convincingly demonstrate the superiority of the method.
1. Motivation. The paper employs 3D assets as conditioning inputs to ensure geometry–texture consistency; however, generating 2D images does not inherently require multi-view conditioning. The authors should clarify the motivation for introducing 3D asset-based conditioning in this context. 2. Task Definition. Given that the 3D asset is already available, there exist simpler approaches to achieve similar results—for example, rendering the desired viewpoint as a conditioning image and feeding it
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis
