Empowering Video Translation using Multimodal Large Language Models

Bingzheng QU; Kehai Chen; Xuefeng Bai; Min Zhang

arXiv:2604.11283·cs.CV·April 14, 2026

Empowering Video Translation using Multimodal Large Language Models

Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang

PDF

TL;DR

This paper provides a comprehensive review of how multimodal large language models (MLLMs) enhance video translation by improving understanding, reasoning, and generation, surpassing traditional methods in quality and robustness.

Contribution

It offers the first systematic overview of MLLMs in video translation, organized into a three-role taxonomy covering understanding, speech generation, and visual synthesis.

Findings

01

MLLM-based systems achieve superior translation quality.

02

They demonstrate robustness in zero-shot and multi-speaker scenarios.

03

The review identifies open challenges and future directions.

Abstract

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.