InfoDet: A Dataset for Infographic Element Detection
Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu

TL;DR
This paper introduces InfoDet, a large dataset of infographics with extensive annotations, to improve visual grounding and object detection in charts and infographic elements for vision-language models.
Contribution
The creation of InfoDet, a comprehensive dataset with over 14 million annotations for infographic elements, supporting advancements in chart understanding and object detection.
Findings
InfoDet enhances chart understanding in vision-language models.
The dataset improves object detection accuracy for infographic elements.
Application of models to document layout and UI detection demonstrates versatility.
Abstract
Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through…
Peer Reviews
Decision·ICLR 2026 Poster
The paper’s methodological rigor and practical significance are outstanding. The hybrid annotation strategy—combining synthetic and real data under a model-in-the-loop refinement—balances scalability and accuracy, achieving dataset quality comparable to COCO. The scale (over 100k infographics) and fine-grained annotations (charts, HROs, sub-elements) fill a clear research gap. The grounded CoT prompting demonstrates genuine insight into how structured visual grounding improves reasoning in moder
First, the paper lacks quantitative validation for synthetic data fidelity and bias, which is a critical limitation given that nearly 90% of InfoDet consists of synthetic infographics. Without quantitative analysis or cross-domain alignment metrics—such as style distribution, semantic bias, or embedding similarity—the fidelity of synthetic data relative to real samples remains uncertain, casting doubt on downstream generalization and fairness. Second, there is insufficient analysis of annotation
1.The paper targets a real and under-served pain point in multimodal/chart understanding—current VLMs and chart/infographic QA systems often fail not because they cannot “reason,” but because they cannot reliably ground the relevant regions in cluttered infographic-style inputs. 2.The proposed InfoDet dataset is both large (≈101K images, ≈14M boxes) and unusually fine-grained, covering text, charts, and human-recognizable objects (HROs), as well as 26 chart-level marks and 75 chart types. This
1.The core novelty lies in building a large, high-quality dataset and a reasonable model-in-the-loop pipeline, plus a demonstrative prompting scheme. Compared to typical ICLR work, the methodological/learning novelty is modest. 2. Given that the dataset provides structured and layered infographic elements, the paper could reasonably be expected to propose a model or training scheme that explicitly exploits this structure (for example, through element-level selection, layout-aware fusion, or hie
**1. Comprehensive and Well-Designed Dataset** - First large-scale infographic dataset (101,264 samples vs. prior Borkin et al. 393 samples) strategically combining real and synthetic data for authenticity and scalability - Efficient model-in-the-loop annotation achieving quality comparable to COCO (precision 93.9%, recall 96.7% vs. COCO's 71.9%/83.0%) - Multi-level annotations: element-level (charts, HROs) and mark-level (26 sub-element categories) providing fine-grained labels - Verified diver
**1. Dataset Construction Issues: Representativeness and Transparency** - **Fine-grained annotation imbalance**: 75 chart types exist only for synthetic infographics. Authors mention GPT-4o achieved only 61.49% accuracy on real infographics but provide no alternative approach (human annotation? better models?), leaving dataset incomplete. - **Annotation process opaque**: No information on expert demographics (number? background: medical imaging experts? graphic designers? CV researchers?), train
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
