RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan; Yuheng Wang; Bohan Fang; Zhongzheng Ren; Ranjay Krishna

arXiv:2605.15196·cs.CV·May 15, 2026

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

PDF

1 Models 1 Datasets

TL;DR

RefDecoder introduces a reference-conditioned video decoder that injects high-fidelity reference images into the decoding process, significantly improving detail preservation and consistency in video generation tasks.

Contribution

The paper proposes RefDecoder, a novel reference-conditioned decoder that enhances existing video generation models by integrating reference image signals via reference attention.

Findings

01

Achieves up to +2.1dB PSNR over unconditional baselines on multiple benchmarks.

02

Improves subject and background consistency in generated videos.

03

Can be integrated into existing systems without additional fine-tuning.

Abstract

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Arrokothwhi/RefDecoder
model

Datasets

aoiandroid/papers
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.