LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Zhihan Xiao; Lin Liu; Yixin Gao; Xiaopeng Zhang; Haoxuan Che; Songping Mai; Qi Tian

arXiv:2512.02933·cs.CV·December 4, 2025

LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che, Songping Mai, Qi Tian

PDF

Open Access 1 Datasets

TL;DR

LoVoRA introduces a mask-free, learnable object-aware localization framework for consistent text-guided video object removal and addition, eliminating the need for auxiliary masks or reference images.

Contribution

It presents a novel, end-to-end video editing method leveraging a diffusion mask predictor and a unique dataset pipeline for scalable, high-quality, temporally consistent edits.

Findings

01

Achieves high-quality, temporally consistent video edits

02

Eliminates reliance on external masks or reference images

03

Demonstrates superior performance through extensive experiments

Abstract

Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

cz-5f/LoVoRA
dataset· 990 dl
990 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques