YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Chenyang Wu; Lina Lei; Fan Li; Chun-Le Guo; Dehong Kong; Xinran Qin; Zhixin Wang; Ming-Ming Cheng; Chongyi Li

arXiv:2604.27322·cs.CV·May 1, 2026

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Chenyang Wu, Lina Lei, Fan Li, Chun-Le Guo, Dehong Kong, Xinran Qin, Zhixin Wang, Ming-Ming Cheng, Chongyi Li

PDF

1 Repo

TL;DR

YOSE is an efficient fine-tuning framework for DiT-based video object removal that selectively processes essential tokens, significantly reducing inference latency while maintaining high visual quality.

Contribution

YOSE introduces Batch Variable-length Indexing and DiffSim modules to adaptively select tokens and simulate diffusion, enabling mask-aware acceleration in video object removal.

Findings

01

YOSE achieves up to 2.5X speedup in 70% of cases.

02

Inference time scales linearly with masked regions.

03

Maintains visual quality comparable to baseline methods.

Abstract

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Wucy0519/YOSE-CVPR26
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.