Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen, Alexis Korotasz, Matthew Sisk, Ming Hu, Chaoli Wang

TL;DR
This paper introduces a scalable framework using multimodal large language models and street-view imagery to automatically assess building conditions and housing attributes across the US, with high accuracy and efficiency.
Contribution
The authors fine-tune LLMs for building assessment, apply knowledge distillation for speed, and develop a visualization tool, advancing large-scale, automated built environment evaluation.
Findings
Gemma 3 27B outperforms human raters on MOS alignment.
Knowledge distillation reduces model size and speeds up inference.
The framework achieves high accuracy with minimal human labeling effort.
Abstract
We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
