GeoGPT4V: Towards Geometric Multi-modal Large Language Models with   Geometric Image Generation

Shihao Cai; Keqin Bao; Hangyu Guo; Jizhi Zhang; Jun Song; Bo Zheng

arXiv:2406.11503·cs.CV·June 18, 2024

GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation

Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, Bo Zheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GeoGPT4V, a new dataset and pipeline for training multi-modal large language models in geometry, significantly improving their ability to understand and generate geometric visual information.

Contribution

The paper presents a novel data generation pipeline using GPT-4 and GPT-4V, creating a high-quality, aligned geometry problem dataset to enhance multi-modal model performance.

Findings

01

GeoGPT4V dataset improves geometry understanding on benchmarks

02

Generated data enhances model performance significantly

03

Pipeline enables effective multi-modal learning in geometry

Abstract

Large language models have seen widespread adoption in math problem-solving. However, in geometry problems that usually require visual aids for better understanding, even the most advanced multi-modal models currently still face challenges in effectively using image information. High-quality data is crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset. Experimental results demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lanyu0303/geogpt4v_project
noneOfficial

Videos

GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer