DIVE: Taming DINO for Subject-Driven Video Editing

Yi Huang; Wei Xiong; He Zhang; Chaoqi Chen; Jianzhuang Liu; Mingfu Yan; Shifeng Chen

arXiv:2412.03347·cs.CV·July 30, 2025

DIVE: Taming DINO for Subject-Driven Video Editing

Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen

PDF

Open Access

TL;DR

DIVE leverages DINOv2 features to enable subject-driven, temporally consistent video editing guided by text prompts or reference images, advancing the quality and robustness of video editing techniques.

Contribution

The paper introduces DIVE, a novel framework that uses DINO features for improved subject-driven video editing with enhanced motion consistency and identity preservation.

Findings

01

Achieves high-quality, temporally consistent video edits.

02

Effectively registers target subjects using Low-Rank Adaptations.

03

Demonstrates robustness across diverse real-world videos.

Abstract

Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Rights Management and Security

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Diffusion