VALHALLA: Visual Hallucination for Machine Translation
Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Chen, Rogerio Feris, David, Cox, Nuno Vasconcelos

TL;DR
VALHALLA introduces a novel multimodal machine translation approach that hallucines visual representations from source text, enabling effective translation without requiring paired images during inference, thus broadening real-world applicability.
Contribution
The paper proposes a visual hallucination framework for machine translation that predicts visual features from text, eliminating the need for paired images at inference time.
Findings
Outperforms text-only baselines on multiple datasets
Achieves competitive results compared to multimodal methods with paired images
Demonstrates robustness across diverse language pairs
Abstract
Designing better machine translation systems by considering auxiliary inputs such as images has attracted much attention in recent years. While existing methods show promising performance over the conventional text-only translation systems, they typically require paired text and image as input during inference, which limits their applicability to real-world scenarios. In this paper, we introduce a visual hallucination framework, called VALHALLA, which requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation. In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation. We train the hallucination transformer jointly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques
