RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang; YuXin Song; Ge Wu; Haocheng Feng; Hang Zhou; Jingdong Wang; Yaxing Wang; jian Yang

arXiv:2603.25743·cs.CV·March 27, 2026

RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang

PDF

Open Access 2 Models

TL;DR

RefAlign introduces an explicit reference feature alignment method for reference-to-video generation, significantly improving identity consistency and semantic accuracy without increasing inference complexity.

Contribution

It proposes a novel reference alignment loss that enhances feature discrimination and identity preservation in R2V generation, outperforming existing methods.

Findings

01

RefAlign achieves superior TotalScore on OpenS2V-Eval benchmark.

02

The method improves identity consistency and semantic fidelity.

03

No additional inference overhead is introduced.

Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis