MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry, Yang, Songwei Ge, Qiyuan Hu, and Devi Parikh

TL;DR
MUGEN is a large-scale, richly annotated video-audio-text dataset from a modified open-source game, designed to advance multimodal understanding and generation research through diverse, challenge-oriented data.
Contribution
This paper introduces MUGEN, a novel multimodal dataset with extensive annotations from a modified game environment, enabling progress in video-audio-text understanding and generation.
Findings
Sampled 375K video clips with human descriptions
Extracted automatic semantic maps and textual annotations
Benchmarked approaches on retrieval and generation tasks
Abstract
Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Natural Language Processing Techniques
