Video-guided Machine Translation with Global Video Context
Jian Chen, JinZe Lv, Zi Long, XiangHua Fu

TL;DR
This paper introduces a novel video-guided multimodal translation framework that leverages global video context and advanced attention mechanisms to improve translation quality in long videos.
Contribution
It proposes a globally video-guided approach using a pretrained semantic encoder and vector database retrieval, enhancing long-video translation beyond local segment alignment.
Findings
Significant performance improvement over baseline models on documentary translation dataset.
Effective utilization of global video context improves translation accuracy.
Region-aware cross-modal attention enhances semantic alignment during translation.
Abstract
Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
