BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

Panwen Hu; Jiehui Huang; Qiang Sun; Xiaodan Liang

arXiv:2505.06985·cs.CV·May 13, 2025

BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

Panwen Hu, Jiehui Huang, Qiang Sun, Xiaodan Liang

PDF

Open Access

TL;DR

This paper introduces BridgeIV, a novel method for customized text-to-video generation that uses autoregressive feature propagation and test-time optimization to improve structural and texture consistency, outperforming existing methods.

Contribution

The paper presents a new autoregressive structure and texture propagation module combined with test-time reward optimization for enhanced customized text-to-video generation.

Findings

01

Improved CLIP-I consistency by 7.8 points.

02

Enhanced DINO consistency by 13.1 points.

03

Validated effectiveness through extensive experiments.

Abstract

Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games

MethodsAttention Is All You Need · Layer Normalization · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels