VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

Tao Zhang; Shiqing Wei; Shihao Chen; Wenling Yu; Muying Luo; Shunping Ji

arXiv:2507.04664·cs.CV·July 8, 2025

VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo, Shunping Ji

PDF

TL;DR

VectorLLM introduces a novel multimodal large language model that directly regresses building contours from remote sensing images, outperforming previous methods and demonstrating strong zero-shot generalization to various objects.

Contribution

The paper presents the first LLM-based approach for direct corner-point regression of building contours, integrating vision and language models for improved accuracy and generalization.

Findings

01

Outperforms previous SOTA by 5.6-13.6 AP across datasets

02

Achieves strong zero-shot performance on unseen objects

03

Establishes a new paradigm for vector extraction in remote sensing

Abstract

Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.