Tangram: Benchmark for Evaluating Geometric Element Recognition in Large   Multimodal Models

Chao Zhang; Jiamin Tang; Jing Xiao

arXiv:2408.13854·cs.CV·December 18, 2024

Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models

Chao Zhang, Jiamin Tang, Jing Xiao

PDF

Open Access

TL;DR

Tangram introduces a benchmark for assessing large multimodal models' ability to recognize geometric elements in diagrams, revealing significant performance gaps and highlighting the need for improved perception capabilities.

Contribution

We created Tangram, a comprehensive benchmark with diverse geometric diagrams to evaluate LMMs' geometric element recognition, an area underexplored in current research.

Findings

01

Top model accuracy is only 53% on geometric recognition tasks.

02

Current LMMs struggle with basic geometric perception.

03

The benchmark exposes significant gaps in multimodal models' understanding.

Abstract

Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains underexplored. To address this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram comprises 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, ranging from simple geometric shapes to complex combinations. Each diagram is paired with four questions, resulting in 4,320 visual-question-answer pairs. Unlike existing benchmarks that emphasize higher-level cognition and reasoning, Tangram focuses on understanding geometric elements, requiring models to perform a ``simple yet challenging" counting task. Systematic evaluation of 13 prominent LMMs, such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Image and Object Detection Techniques

MethodsFast Attention Via Positive Orthogonal Random Features · Performer