TL;DR
GeoTikzBridge introduces a framework that significantly improves geometric perception and reasoning in multimodal large language models by generating tikz-based code, supported by large datasets and achieving state-of-the-art results.
Contribution
The paper presents the first instruction-augmented tikz dataset and models that enhance geometric understanding and reasoning in multimodal large language models.
Findings
Models achieve state-of-the-art performance among open-sourced MLLMs.
GeoTikzBridge models serve as plug-and-play modules for geometric reasoning.
The datasets are the largest of their kind, supporting extensive geometric perception.
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
