ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Yichen Lu; Wei Dai; Jiaen Liu; Ching Wing Kwok; Zongheng Wu; Xudong Xiao; Ao Sun; Sheng Fu; Jianyuan Zhan; Yian Wang; Takatomo Saito; Sicheng Lai

arXiv:2507.07306·cs.AI·July 11, 2025

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai

PDF

Open Access

TL;DR

ViDove is a multimodal translation system that uses visual context and memory modules to improve translation quality, especially for complex and long-form content, outperforming previous models.

Contribution

The paper introduces ViDove, a novel multimodal translation agent with memory-augmented reasoning, and presents DoveBench, a new benchmark for long-form video translation.

Findings

01

28% BLEU score improvement over baselines

02

15% SubER improvement in translation quality

03

Effective use of visual context and memory modules

Abstract

LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media