Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models
Chao Zhang, Jiamin Tang, Jing Xiao

TL;DR
Tangram introduces a benchmark for assessing large multimodal models' ability to recognize geometric elements in diagrams, revealing significant performance gaps and highlighting the need for improved perception capabilities.
Contribution
We created Tangram, a comprehensive benchmark with diverse geometric diagrams to evaluate LMMs' geometric element recognition, an area underexplored in current research.
Findings
Top model accuracy is only 53% on geometric recognition tasks.
Current LMMs struggle with basic geometric perception.
The benchmark exposes significant gaps in multimodal models' understanding.
Abstract
Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains underexplored. To address this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram comprises 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, ranging from simple geometric shapes to complex combinations. Each diagram is paired with four questions, resulting in 4,320 visual-question-answer pairs. Unlike existing benchmarks that emphasize higher-level cognition and reasoning, Tangram focuses on understanding geometric elements, requiring models to perform a ``simple yet challenging" counting task. Systematic evaluation of 13 prominent LMMs, such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Image and Object Detection Techniques
MethodsFast Attention Via Positive Orthogonal Random Features · Performer
