Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

Peter A. Massih; Eric Cosatto

arXiv:2601.13401·cs.CV·January 21, 2026

Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

Peter A. Massih, Eric Cosatto

PDF

Open Access 1 Datasets

TL;DR

This paper introduces QVLM, a new architecture that preserves pixel-level spatial information for quantitative reasoning in satellite images, and SQuID, a dataset for evaluating such reasoning, demonstrating significant accuracy improvements over traditional VLMs.

Contribution

The paper presents a novel code-generation architecture, QVLM, and a new dataset, SQuID, to improve quantitative spatial reasoning in vision-language models.

Findings

01

QVLM achieves 42.0% accuracy on SQuID

02

Traditional VLMs achieve 28.1% accuracy on SQuID

03

Architectural decoupling improves quantitative reasoning performance

Abstract

Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PeterAM4/SQuID
dataset· 114 dl
114 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Geographic Information Systems Studies · Multimodal Machine Learning Applications