Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification
Liguang Zhou, Yuhongze Zhou, Xiaonan Qi, Junjie Hu, Tin Lun Lam,, Yangsheng Xu

TL;DR
This paper introduces an end-to-end attentional graph convolutional network (AGCN) that captures structure-aware features for audio-visual scene classification, emphasizing salient and semantic regions in both modalities.
Contribution
The paper proposes a novel AGCN framework that constructs and utilizes multiple graphs to represent salient and contextual information in audio-visual data for improved scene recognition.
Findings
Achieved promising results on multiple scene recognition datasets.
Effectively visualized graphs to highlight salient regions.
Demonstrated the importance of structure-aware features in audio-visual understanding.
Abstract
Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
MethodsAdaptive Graph Convolutional Neural Networks
