Multimodal Machine Translation through Visuals and Speech

Umut Sulubacak; Ozan Caglayan; Stig-Arne Gr\"onroos; Aku Rouhe,; Desmond Elliott; Lucia Specia; J\"org Tiedemann

arXiv:1911.12798·cs.CL·December 2, 2019

Multimodal Machine Translation through Visuals and Speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Gr\"onroos, Aku Rouhe,, Desmond Elliott, Lucia Specia, J\"org Tiedemann

PDF

Open Access

TL;DR

This survey reviews multimodal machine translation methods that leverage visual and speech modalities, discussing datasets, evaluation, state-of-the-art approaches, challenges, and future research directions.

Contribution

It provides a comprehensive overview of multimodal translation tasks, datasets, evaluation campaigns, and highlights future challenges and directions in the field.

Findings

01

Summarizes major datasets and evaluation campaigns.

02

Analyzes state-of-the-art end-to-end and pipeline approaches.

03

Identifies challenges and future research directions.

Abstract

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media