Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)
Bin Han, Yiwei Yang, Anat Caspi, Bill Howe

TL;DR
This paper explores using vision-language models with segmentation prompting to automatically annotate diverse urban features from satellite images, aiming to reduce manual effort and improve urban infrastructure data quality.
Contribution
It introduces a novel prompting strategy for vision-language models to better identify esoteric built environment features in satellite imagery.
Findings
Zero-shot prompting fails to annotate urban features effectively.
Pre-segmentation prompting achieves up to 40% intersection-over-union accuracy.
Results suggest potential for scalable automatic urban environment annotation.
Abstract
Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage
