MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley   Audio Content Planning and Generation

Ruibo Fu; Shuchen Shi; Hongming Guo; Tao Wang; Chunyu Qiang; Zhengqi; Wen; Jianhua Tao; Xin Qi; Yi Lu; Xiaopeng Wang; Zhiyong Wang; Yukun Liu,; Xuefei Liu; Shuai Zhang; Guanjun Li

arXiv:2406.10591·eess.AS·June 18, 2024

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi, Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu,, Xuefei Liu, Shuai Zhang, Guanjun Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces MINT, a comprehensive multi-modal dataset and a novel framework for improving AI-generated foley audio dubbing by enhancing scene understanding and content alignment, surpassing existing models in realism and accuracy.

Contribution

The paper presents the MINT dataset and a CPGA framework that leverages large language models and reinforcement learning to significantly improve foley audio dubbing quality.

Findings

01

Outperforms existing multimodal models like LLaVA and DeepSeek-VL.

02

Enhances scene matching and content correlation in foley audio generation.

03

Achieves more realistic and aligned audio outputs in multimedia content.

Abstract

Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

borisfrb/MINT
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Attention Dropout · Weight Decay · Dropout · Adam · Linear Warmup With Cosine Annealing