Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Hao Yu; Shuning Jia; Guanghao Li; Wenhao Jiang; Chun Yuan

arXiv:2602.22703·cs.LG·February 27, 2026

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GeoPerceive, a benchmark for evaluating geometric perception in vision-language models, and proposes GeoDPO, a reinforcement learning framework that significantly improves geometric reasoning capabilities and generalization in VLMs.

Contribution

The paper presents GeoPerceive for isolated geometric perception evaluation and GeoDPO, a novel RL method using an NL-to-DSL translator to enhance VLMs' geometric reasoning.

Findings

01

GeoDPO achieves +26.5% in-domain performance.

02

GeoDPO improves out-of-domain accuracy by +8.0%.

03

GeoDPO enhances downstream reasoning tasks by +39.0%.

Abstract

Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 5

Strengths

1. Benchmark and Synthetic Data Pipeline: The paper introduces GEOPERCEIVE, which systematically generates and renders diagrams with unambiguous geometric DSLs. The pipeline creates diverse complexity-controlled data at scale, providing a solid resource for both evaluation and training. 2. Meaningful Visualizations: Figures such as Figure 1 (exposing ambiguity in existing DSLs), Figure 2 (illustrating GeoDSL syntax), Figure 3 (full pipeline including RL steps), and Figure 4 (showing data comple

Weaknesses

1. Lack of Rigor in the Task: The geometric perception task is not merely a simple perceptual task. Specifically, the geometric information obtained through basic image perception should undergo rigorous verification (at least ensuring that the perceived constraints are consistent and non-conflicting). For example, certain relationships in the image, such as perpendicularity or numerical data, must be strictly validated against predefined conditions. 2. Insufficient Explanation of Geometric Rel

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper provides a rigorous and clear exposition of its methodology and experiment; 2. The proposed GEOPERCEIVE pipeline provides an efficient automation scheme for the generation of geometric perception related data; 3. It emphasizes the distinction between perception and reasoning, realizes the transformation from ambiguous natural language to strict DSL through NL-to-DSL translator, and provides a standard reward model for geometric perception results.

Weaknesses

1. Lack of comparison with related methods in the experiment (e.g. [Slow Perception](https://arxiv.org/abs/2412.20631), [EAGLE](https://arxiv.org/abs/2408.11397)); 2. Partial experimental results show no significant improvement compared to the SFT method.

Reviewer 03Rating 6Confidence 3

Strengths

1. The authors clearly identify geometric perception as a distinct subproblem within vision-language reasoning, separating low-level perception (recognition of geometric primitives and relations) from high-level symbolic reasoning (logical deduction). This represents a novel and well-motivated decomposition that has not been explicitly formalized in prior VLM research. 2. The paper conducts extensive benchmarking across multiple major vision-language model backbones (Qwen2.5-VL, InternVL3, LLaVA

Weaknesses

1. In Section 3, the paper introduces the GeoPerceive Benchmark, but it does not provide sufficient details regarding its categorical composition, dataset scale, or comparative positioning relative to existing benchmarks. It is recommended that the authors include a summary table outlining the dataset’s structure, sample counts, and distinctions from related benchmarks to improve clarity and reproducibility. 2. The experimental comparisons primarily focus on SFT-based fine-tuning, without incorp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Robot Manipulation and Learning