Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan

TL;DR
This paper introduces GeoPerceive, a benchmark for evaluating geometric perception in vision-language models, and proposes GeoDPO, a reinforcement learning framework that significantly improves geometric reasoning capabilities and generalization in VLMs.
Contribution
The paper presents GeoPerceive for isolated geometric perception evaluation and GeoDPO, a novel RL method using an NL-to-DSL translator to enhance VLMs' geometric reasoning.
Findings
GeoDPO achieves +26.5% in-domain performance.
GeoDPO improves out-of-domain accuracy by +8.0%.
GeoDPO enhances downstream reasoning tasks by +39.0%.
Abstract
Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in…
Peer Reviews
Decision·ICLR 2026 Poster
1. Benchmark and Synthetic Data Pipeline: The paper introduces GEOPERCEIVE, which systematically generates and renders diagrams with unambiguous geometric DSLs. The pipeline creates diverse complexity-controlled data at scale, providing a solid resource for both evaluation and training. 2. Meaningful Visualizations: Figures such as Figure 1 (exposing ambiguity in existing DSLs), Figure 2 (illustrating GeoDSL syntax), Figure 3 (full pipeline including RL steps), and Figure 4 (showing data comple
1. Lack of Rigor in the Task: The geometric perception task is not merely a simple perceptual task. Specifically, the geometric information obtained through basic image perception should undergo rigorous verification (at least ensuring that the perceived constraints are consistent and non-conflicting). For example, certain relationships in the image, such as perpendicularity or numerical data, must be strictly validated against predefined conditions. 2. Insufficient Explanation of Geometric Rel
1. This paper provides a rigorous and clear exposition of its methodology and experiment; 2. The proposed GEOPERCEIVE pipeline provides an efficient automation scheme for the generation of geometric perception related data; 3. It emphasizes the distinction between perception and reasoning, realizes the transformation from ambiguous natural language to strict DSL through NL-to-DSL translator, and provides a standard reward model for geometric perception results.
1. Lack of comparison with related methods in the experiment (e.g. [Slow Perception](https://arxiv.org/abs/2412.20631), [EAGLE](https://arxiv.org/abs/2408.11397)); 2. Partial experimental results show no significant improvement compared to the SFT method.
1. The authors clearly identify geometric perception as a distinct subproblem within vision-language reasoning, separating low-level perception (recognition of geometric primitives and relations) from high-level symbolic reasoning (logical deduction). This represents a novel and well-motivated decomposition that has not been explicitly formalized in prior VLM research. 2. The paper conducts extensive benchmarking across multiple major vision-language model backbones (Qwen2.5-VL, InternVL3, LLaVA
1. In Section 3, the paper introduces the GeoPerceive Benchmark, but it does not provide sufficient details regarding its categorical composition, dataset scale, or comparative positioning relative to existing benchmarks. It is recommended that the authors include a summary table outlining the dataset’s structure, sample counts, and distinctions from related benchmarks to improve clarity and reproducibility. 2. The experimental comparisons primarily focus on SFT-based fine-tuning, without incorp
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Robot Manipulation and Learning
