Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Junming Liu; Yuqi Li; Yifei Sun; Maonan Wang; Piotr Koniusz; Yirong Chen; Ding Wang

arXiv:2605.18162·cs.CV·May 19, 2026

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

PDF

TL;DR

This paper introduces SAGE, a self-evolving framework that enhances spatial reasoning in vision-language models by enforcing geometric and linguistic logical consistency, leading to improved robustness and generalization.

Contribution

SAGE is a novel, model-agnostic, and data-efficient method that applies geometric logic consistency as a self-evolving auxiliary training process for VLMs.

Findings

01

SAGE improves spatial reasoning accuracy on benchmarks.

02

Models trained with SAGE generalize better to unseen data.

03

SAGE is lightweight and can be applied post-training.

Abstract

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.