Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery
Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang

TL;DR
This paper explores how urban sounds correspond with visual imagery using multimodal models across three major cities, revealing that embedding models align better with sounds than segmentation methods.
Contribution
It compares street-level and remote sensing imagery for sound-vision alignment, demonstrating the effectiveness of embedding-based models in urban acoustic-visual analysis.
Findings
Street view embeddings show stronger sound alignment than segmentation outputs.
Remote sensing segmentation better interprets ecological categories.
Embedding models offer superior semantic alignment for urban sound analysis.
Abstract
Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
