SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer
Zheng Hui, Yunlong Bai

TL;DR
SEDiT introduces a novel, mask-free, one-step diffusion transformer approach for video subtitle erasure, enabling efficient, high-quality editing without prior segmentation masks or multi-stage processing.
Contribution
The paper proposes a one-stage, mask-free video subtitle erasure method using diffusion transformers, improving efficiency and accuracy over traditional multi-stage inpainting techniques.
Findings
Effective removal of subtitles in high-resolution videos.
Maintains temporal consistency with hybrid training strategies.
Supports long, native 1440p videos with streaming inference.
Abstract
Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
