How2: A Large-scale Dataset for Multimodal Language Understanding
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott,, Lo\"ic Barrault, Lucia Specia, Florian Metze

TL;DR
This paper introduces How2, a large-scale multimodal dataset with videos, subtitles, and translations, along with baseline models for various language understanding tasks to advance research in multimodal NLP.
Contribution
The paper provides a new extensive multimodal dataset and baseline models for multiple language understanding tasks, facilitating future research in multimodal NLP.
Findings
Dataset enables research on multimodal language tasks
Baseline models demonstrate feasibility of multimodal processing
Data and code availability promotes further advancements
Abstract
In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
