CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He; Chaoyi Wang; Peng Tang; Yifan Yang; Xiaobin Hu

arXiv:2603.21901·cs.CV·May 12, 2026

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

PDF

TL;DR

CLEAR is a novel mask-free framework for adaptive video subtitle removal that leverages context-aware learning and generative feedback, outperforming existing methods in quality and generalization.

Contribution

The paper introduces a mask-free, end-to-end subtitle removal method with a two-stage design and generation feedback, reducing parameter requirements and enhancing zero-shot multilingual performance.

Findings

01

Outperforms mask-dependent baselines by +6.77dB PSNR and -74.7% VFID on Chinese benchmarks.

02

Requires only 0.77% of the diffusion model parameters for training.

03

Demonstrates superior zero-shot generalization across six languages.

Abstract

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.