How2: A Large-scale Dataset for Multimodal Language Understanding

Ramon Sanabria; Ozan Caglayan; Shruti Palaskar; Desmond Elliott,; Lo\"ic Barrault; Lucia Specia; Florian Metze

arXiv:1811.00347·cs.CL·December 10, 2018·153 cites

How2: A Large-scale Dataset for Multimodal Language Understanding

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott,, Lo\"ic Barrault, Lucia Specia, Florian Metze

PDF

Open Access 2 Repos

TL;DR

This paper introduces How2, a large-scale multimodal dataset with videos, subtitles, and translations, along with baseline models for various language understanding tasks to advance research in multimodal NLP.

Contribution

The paper provides a new extensive multimodal dataset and baseline models for multiple language understanding tasks, facilitating future research in multimodal NLP.

Findings

01

Dataset enables research on multimodal language tasks

02

Baseline models demonstrate feasibility of multimodal processing

03

Data and code availability promotes further advancements

Abstract

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems