RefAlign: Representation Alignment for Reference-to-Video Generation
Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang

TL;DR
RefAlign introduces an explicit reference feature alignment method for reference-to-video generation, significantly improving identity consistency and semantic accuracy without increasing inference complexity.
Contribution
It proposes a novel reference alignment loss that enhances feature discrimination and identity preservation in R2V generation, outperforming existing methods.
Findings
RefAlign achieves superior TotalScore on OpenS2V-Eval benchmark.
The method improves identity consistency and semantic fidelity.
No additional inference overhead is introduced.
Abstract
Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
