RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang; Qingyang Bao; Zekai Gu; Zhongshuo Du; Cheng Lin; Yuan Liu; Sibei Yang

arXiv:2601.22094·cs.CV·January 30, 2026

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

PDF

Open Access 1 Models 3 Reviews

TL;DR

RefAny3D introduces a novel 3D asset-referenced diffusion model that integrates multi-view 3D assets into image generation, enabling consistent and versatile 2D image synthesis aligned with 3D references.

Contribution

It presents a cross-domain diffusion framework with dual-branch perception for joint modeling of 3D assets and images, enhancing reference-based image generation capabilities.

Findings

01

Effectively leverages 3D assets for image generation

02

Produces spatially aligned RGB images and point maps

03

Demonstrates improved consistency with 3D references

Abstract

In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The pipeline that pairs real images with mesh assets (via Hunyuan3D + FoundationPose) to obtain aligned point maps is practically useful for research on 3D-aware image diffusion. 2. Domain-LoRA and Reference-LoRA help the network simultaneously generate point maps and RGB, improving stability and disentanglement. 3. The paper is easy to follow, with sound motivation and diagrams. 4. Ablations and comparisons are extensive; qualitative results show crisper textures and better geometric adhe

Weaknesses

1. Potential supervision noise from image-to-3D. The dataset relies on image-to-3D generators, which may not perfectly preserve reference fidelity, injecting bias into the training signal. The paper should quantify how frequently generator artifacts or pose errors degrade the learned 3D-conditioned diffusion, and propose mitigation. 2. It remains unclear how much of the gain comes from generating point maps versus simply conditioning on multi-view images; a rigorous comparison against a “no-poi

Reviewer 02Rating 2Confidence 4

Strengths

Product-wise, this task is very useful for commercializing generative models for rendering different photos of specific object products, mostly like a diffusion model shader.

Weaknesses

There is too little technical contribution in this paper. Although the pipeline works, it is a straightforward engineering pipeline. I believe this is very practical for industry and product applications, but far below the standard of an ICLR paper.

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is clearly structured and well-written, with smooth logical flow from motivation to methodology and results. 2. The work introduces a novel 3D asset-referenced diffusion framework that bridges the gap between 2D reference-based generation and 3D-aware synthesis. 3. The experimental section is comprehensive, including qualitative, quantitative, and ablation studies that convincingly demonstrate the superiority of the method.

Weaknesses

1. Motivation. The paper employs 3D assets as conditioning inputs to ensure geometry–texture consistency; however, generating 2D images does not inherently require multi-view conditioning. The authors should clarify the motivation for introducing 3D asset-based conditioning in this context. 2. Task Definition. Given that the 3D asset is already available, there exist simpler approaches to achieve similar results—for example, rendering the desired viewpoint as a conditioning image and feeding it

Code & Models

Models

🤗
JudgementH/RefAny3D
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis