TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
Yikun Zong, Cheston Tan

TL;DR
This paper introduces a test-time self-refinement framework inspired by human spatial reasoning, significantly improving vision-language models' ability to perform continuous geometric reasoning tasks like Tangram puzzles without retraining.
Contribution
We propose a training-free verifier-refiner framework that enhances geometric reasoning in vision-language models through iterative self-refinement using in-context learning and reward feedback.
Findings
IoU improved from 0.63 to 0.932 on medium-triangle cases
Systematic failures in existing VLMs on continuous geometric tasks
Iterative refinement significantly boosts spatial reasoning performance
Abstract
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation · Multimodal Machine Learning Applications · Robot Manipulation and Learning
