ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification
Bochao Sun, Dong Wang, ZhanLong Yang, Jun Yang, Han Yin

TL;DR
This paper introduces ASCMamba, a multimodal neural network that combines audio and textual data for improved acoustic scene classification, achieving state-of-the-art results in a challenging competition setting.
Contribution
We propose a novel multimodal network architecture, ASCMamba, integrating hierarchical spectral features and long-range dependencies for enhanced acoustic scene understanding.
Findings
Outperforms all participating teams in the challenge
Achieves 6.2% improvement over baseline
Demonstrates effectiveness of multimodal integration
Abstract
Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, ASCMamba, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
