Dual-branch Prompting for Multimodal Machine Translation
Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

TL;DR
This paper introduces D2P-MMT, a diffusion-based dual-branch prompting framework that enhances multimodal machine translation robustness by using reconstructed images, reducing visual noise influence, and improving translation accuracy.
Contribution
The paper proposes a novel diffusion-based dual-branch prompting method with distributional alignment for more robust multimodal translation.
Findings
Outperforms existing state-of-the-art methods on Multi30K dataset.
Effectively filters visual noise through reconstructed images.
Improves cross-modal interaction and translation quality.
Abstract
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
