MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

Kazuya Tateishi; Akira Takahashi; Atsuo Hiroe; Hirofumi Takeda; Shusuke Takahashi; Yuki Mitsufuji

arXiv:2605.00495·cs.SD·May 4, 2026

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe, Hirofumi Takeda, Shusuke Takahashi, Yuki Mitsufuji

PDF

TL;DR

MMAudio-LABEL is a novel framework that jointly generates audio and sound event labels from silent videos, significantly improving event detection accuracy over traditional post-processing methods.

Contribution

It introduces a joint audio generation and event prediction model that enhances interpretability and accuracy in silent video audio synthesis.

Findings

01

Onset detection accuracy improved from 46.7% to 75.0%.

02

Material classification accuracy increased from 40.6% to 61.0%.

03

Joint learning outperforms baseline post-hoc detection methods.

Abstract

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.