Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Tarun Chintada, Kshetrimayum Boynao Singh, Asif Ekbal

TL;DR
This paper explores multimodal movie subtitle translation for Indic languages, demonstrating that selective visual grounding improves translation quality by capturing scene context and emotion, especially in long videos.
Contribution
It introduces a lightweight visual grounding strategy using attribute summaries and shows that selective grounding enhances translation without extensive visual processing.
Findings
Oracle selective grounding improves translation quality.
Attribute-based summaries effectively capture scene context.
Temporal misalignment challenges are significant in long videos.
Abstract
Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
