Learning Street View Representations with Spatiotemporal Contrast
Yong Li, Yingjing Huang, Gengchen Mai, Fan Zhang

TL;DR
This paper introduces a self-supervised contrastive learning framework that leverages spatiotemporal street view imagery to learn robust urban environment representations, improving performance in various city-related tasks.
Contribution
It proposes a novel spatiotemporal contrastive learning method that captures dynamic and built environment features from street view images, advancing urban visual representation learning.
Findings
Outperforms traditional supervised and unsupervised methods in visual place recognition.
Enhances socioeconomic estimation accuracy using learned representations.
Reveals different behaviors of representations across downstream tasks.
Abstract
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
