Multimodal Machine Translation through Visuals and Speech
Umut Sulubacak, Ozan Caglayan, Stig-Arne Gr\"onroos, Aku Rouhe,, Desmond Elliott, Lucia Specia, J\"org Tiedemann

TL;DR
This survey reviews multimodal machine translation methods that leverage visual and speech modalities, discussing datasets, evaluation, state-of-the-art approaches, challenges, and future research directions.
Contribution
It provides a comprehensive overview of multimodal translation tasks, datasets, evaluation campaigns, and highlights future challenges and directions in the field.
Findings
Summarizes major datasets and evaluation campaigns.
Analyzes state-of-the-art end-to-end and pipeline approaches.
Identifies challenges and future research directions.
Abstract
Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
