TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong; Cheston Tan

arXiv:2602.05570·cs.AI·February 6, 2026

TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong, Cheston Tan

PDF

Open Access

TL;DR

This paper introduces a test-time self-refinement framework inspired by human spatial reasoning, significantly improving vision-language models' ability to perform continuous geometric reasoning tasks like Tangram puzzles without retraining.

Contribution

We propose a training-free verifier-refiner framework that enhances geometric reasoning in vision-language models through iterative self-refinement using in-context learning and reward feedback.

Findings

01

IoU improved from 0.63 to 0.932 on medium-triangle cases

02

Systematic failures in existing VLMs on continuous geometric tasks

03

Iterative refinement significantly boosts spatial reasoning performance

Abstract

Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpatial Cognition and Navigation · Multimodal Machine Learning Applications · Robot Manipulation and Learning