MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and   GENeration

Thomas Hayes; Songyang Zhang; Xi Yin; Guan Pang; Sasha Sheng; Harry; Yang; Songwei Ge; Qiyuan Hu; and Devi Parikh

arXiv:2204.08058·cs.CV·April 29, 2022

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry, Yang, Songwei Ge, Qiyuan Hu, and Devi Parikh

PDF

Open Access 2 Repos

TL;DR

MUGEN is a large-scale, richly annotated video-audio-text dataset from a modified open-source game, designed to advance multimodal understanding and generation research through diverse, challenge-oriented data.

Contribution

This paper introduces MUGEN, a novel multimodal dataset with extensive annotations from a modified game environment, enabling progress in video-audio-text understanding and generation.

Findings

01

Sampled 375K video clips with human descriptions

02

Extracted automatic semantic maps and textual annotations

03

Benchmarked approaches on retrieval and generation tasks

Abstract

Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Natural Language Processing Techniques