CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

TL;DR
CLEAR is a novel mask-free framework for adaptive video subtitle removal that leverages context-aware learning and generative feedback, outperforming existing methods in quality and generalization.
Contribution
The paper introduces a mask-free, end-to-end subtitle removal method with a two-stage design and generation feedback, reducing parameter requirements and enhancing zero-shot multilingual performance.
Findings
Outperforms mask-dependent baselines by +6.77dB PSNR and -74.7% VFID on Chinese benchmarks.
Requires only 0.77% of the diffusion model parameters for training.
Demonstrates superior zero-shot generalization across six languages.
Abstract
Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
