Empowering Video Translation using Multimodal Large Language Models
Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang

TL;DR
This paper provides a comprehensive review of how multimodal large language models (MLLMs) enhance video translation by improving understanding, reasoning, and generation, surpassing traditional methods in quality and robustness.
Contribution
It offers the first systematic overview of MLLMs in video translation, organized into a three-role taxonomy covering understanding, speech generation, and visual synthesis.
Findings
MLLM-based systems achieve superior translation quality.
They demonstrate robustness in zero-shot and multi-speaker scenarios.
The review identifies open challenges and future directions.
Abstract
Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
