Multimodal Urban Sound Tagging with Spatiotemporal Context
Jisheng Bai, Jianfeng Chen, Mou Wang

TL;DR
This paper introduces a multimodal urban sound tagging system that integrates audio features with spatiotemporal context, significantly improving noise pollution monitoring accuracy in urban environments.
Contribution
The study presents a novel multimodal approach combining audio and spatiotemporal data, with a data filtering technique, to enhance urban sound tagging performance.
Findings
Effective integration of spatiotemporal context improves sound classification accuracy.
The proposed method outperforms previous approaches on the DCASE2020 UST challenge.
Data filtering enhances multi-modal learning effectiveness.
Abstract
Noise pollution significantly affects our daily life and urban development. Urban Sound Tagging (UST) has attracted much attention recently, which aims to analyze and monitor urban noise pollution. One weakness of the previous UST studies is that the spatial and temporal context of sound signals, which contains complementary information about when and where the audio data was recorded, has not been investigated. To address this problem, in this paper, we propose a multimodal UST system that deeply mines the audio and spatiotemporal context together. In order to incorporate characteristics of different acoustic features, two sets of four spectrograms are first extracted as the inputs of residual neural networks. Then, the spatiotemporal context is encoded and combined with acoustic features to explore the efficiency of multimodal learning for discriminating sound signals. Moreover, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNoise Effects and Management · Music and Audio Processing · Speech and Audio Processing
