RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

TL;DR
RefDecoder introduces a reference-conditioned video decoder that injects high-fidelity reference images into the decoding process, significantly improving detail preservation and consistency in video generation tasks.
Contribution
The paper proposes RefDecoder, a novel reference-conditioned decoder that enhances existing video generation models by integrating reference image signals via reference attention.
Findings
Achieves up to +2.1dB PSNR over unconditional baselines on multiple benchmarks.
Improves subject and background consistency in generated videos.
Can be integrated into existing systems without additional fine-tuning.
Abstract
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
