Dual-branch Prompting for Multimodal Machine Translation

Jie Wang; Zhendong Yang; Liansong Zong; Xiaobo Zhang; Dexian Wang; Ji Zhang

arXiv:2507.17588·cs.CV·December 5, 2025

Dual-branch Prompting for Multimodal Machine Translation

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

PDF

Open Access

TL;DR

This paper introduces D2P-MMT, a diffusion-based dual-branch prompting framework that enhances multimodal machine translation robustness by using reconstructed images, reducing visual noise influence, and improving translation accuracy.

Contribution

The paper proposes a novel diffusion-based dual-branch prompting method with distributional alignment for more robust multimodal translation.

Findings

01

Outperforms existing state-of-the-art methods on Multi30K dataset.

02

Effectively filters visual noise through reconstructed images.

03

Improves cross-modal interaction and translation quality.

Abstract

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems