City-Identification of Flickr Videos Using Semantic Acoustic Features
Benjamin Elizalde, Guan-Lin Chao, Ming Zeng, Ian Lane

TL;DR
This paper introduces a novel audio-only method for city-identification of videos using semantic acoustic features, demonstrating that urban sounds can effectively indicate city location and improve identification accuracy.
Contribution
The paper presents a new semantic acoustic feature extraction method for city-identification, showing improved performance without relying on visual or metadata modalities.
Findings
Improved state-of-the-art accuracy in city-identification
Semantic acoustic features correlate strongly with city location
Urban sound taxonomy enhances identification performance
Abstract
City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city- soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
