Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics
Peter A. Massih, Eric Cosatto

TL;DR
This paper introduces QVLM, a new architecture that preserves pixel-level spatial information for quantitative reasoning in satellite images, and SQuID, a dataset for evaluating such reasoning, demonstrating significant accuracy improvements over traditional VLMs.
Contribution
The paper presents a novel code-generation architecture, QVLM, and a new dataset, SQuID, to improve quantitative spatial reasoning in vision-language models.
Findings
QVLM achieves 42.0% accuracy on SQuID
Traditional VLMs achieve 28.1% accuracy on SQuID
Architectural decoupling improves quantitative reasoning performance
Abstract
Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Geographic Information Systems Studies · Multimodal Machine Learning Applications
