G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao; Renjie Pi; Jipeng Zhang; Jiacheng Ye; Wanjun Zhong; Yufei Wang; Lanqing Hong; Jianhua Han; Hang Xu; Zhenguo Li; Lingpeng Kong

arXiv:2312.11370·cs.CL·August 21, 2025·5 cites

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong

PDF

Open Access 3 Repos 3 Reviews

TL;DR

This paper introduces G-LLaVA, a multimodal large language model designed to solve geometric problems by understanding images, significantly outperforming existing models like GPT-4-V on geometric reasoning benchmarks.

Contribution

The paper presents a new multimodal dataset Geo170K and a model G-LLaVA that effectively integrates geometric image understanding with language reasoning, advancing geometric problem solving capabilities.

Findings

01

G-LLaVA outperforms GPT-4-V on MathVista benchmark.

02

Constructed Geo170K dataset with 170K geometric image-question pairs.

03

G-LLaVA achieves high accuracy in geometric reasoning tasks.

Abstract

Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- This paper curates a large-scale geometry dataset (28 times larger than the biggest existing geometry dataset) containing both detailed description of geometric images and reasoning paths, which may help MLLMs improve their ability in geometric comprehension. - Data augmentation techniques like value scaling, equation solving, Re-Formulating conditions as unknown, and sentence paraphrase make it convenient to enlarge the size of dataset and enhance robustness. - Extensive experiments validate

Weaknesses

- The experiments are all about choice questions. Some works also involve proving[1] and completion[2] questions, which can be more difficult. Experiments on such questions are neglected. - In many experiments, the strongest existing model GPT4-V is not included, making the result less convincing. - The proposed G-LLaVA selects LLAMA-2 as LLM and a pretrained ViT as the vision encoder. A comparative analysis or experiment on other architecture or pretrained model may help justify this choice. [

Reviewer 02Rating 5Confidence 4

Strengths

1. Understanding geometric problems is important for MLLMs. 2. Presents a dataset with geometric alignment and instruction data.

Weaknesses

1. The novelty is limited, as it primarily fine-tunes an existing MLLM. 2. The models compared in Table 14 need to be updated; stronger baselines, such as QwenVL2, InternVL2 should be considered. 3. Since the authors claim that G-LLaVA enhances geometric understanding from image input, it would be interesting to see the G-LLaVA's performance improvement when only input $Q$ is provided, in Table 8, 9 and 14.

Reviewer 03Rating 5Confidence 5

Strengths

1. This work explores an interesting topic, geometric problem solving, and currently, most MLLMs perform poorly on this task. 2. The expression and organization of this manuscript make it easy for readers to understand. 3. The dataset proposed in this manuscript is relatively large and includes image-caption and question-answer pairs data.

Weaknesses

1. Novelty and Technical Contribution: The proposed geometric dataset construction approach relies on the existing logical annotation data and uses ChatGPT for automatic labeling. Besides, the overall model structure and training strategy mainly apply the existing techniques, lacking sufficient innovation. Please explain the differences in model structure or training strategy during the rebuttal phase and whether the authors made customized designs on a geometric domain. No specific consideratio

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Text Readability and Simplification