VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs
Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo, Shunping Ji

TL;DR
VectorLLM introduces a novel multimodal large language model that directly regresses building contours from remote sensing images, outperforming previous methods and demonstrating strong zero-shot generalization to various objects.
Contribution
The paper presents the first LLM-based approach for direct corner-point regression of building contours, integrating vision and language models for improved accuracy and generalization.
Findings
Outperforms previous SOTA by 5.6-13.6 AP across datasets
Achieves strong zero-shot performance on unseen objects
Establishes a new paradigm for vector extraction in remote sensing
Abstract
Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
