FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Shunian Chen; Xinyuan Xie; Zheshu Chen; Liyan Zhao; Owen Lee; Zhan Su; Qilin Sun; Benyou Wang

arXiv:2506.01111·cs.SD·June 3, 2025

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel two-stage pipeline for fine-grained, context-aware audio captioning that leverages multimodal cues and large language models, supported by a new large-scale dataset called FusionAudio-1.2M.

Contribution

It presents a scalable method for detailed audio captioning, introduces the FusionAudio dataset with 1.2 million captions, and develops improved audio models with better audio-text alignment.

Findings

01

Enhanced audio captioning accuracy with multimodal fusion

02

FusionAudio dataset enables better training of audio models

03

Improved audio-text alignment using CLAP-based encoder

Abstract

High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

satsuki2486441738/fusionaudio
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing