Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition
Di Hu, Xuhong Li, Lichao Mou, Pu Jin, Dong Chen, Liping Jing,, Xiaoxiang Zhu, Dejing Dou

TL;DR
This paper introduces a novel audiovisual approach to aerial scene recognition, leveraging sound event knowledge to enhance visual classification accuracy, supported by a new dataset and three transfer learning methods.
Contribution
It proposes a new multimodal framework that transfers sound event knowledge to improve aerial scene recognition, along with a new dataset for evaluation.
Findings
Audio information improves scene recognition accuracy
Three transfer methods demonstrate effective knowledge transfer
New dataset ADVANCE supports multimodal aerial scene analysis
Abstract
Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques
