Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents
Yiyang Lu, Woong Shin, Ahmad Maroof Karimi, Feiyi Wang, Jie Ren, Evgenia Smirni

TL;DR
This paper introduces IVG, a framework combining spec-based introspection and view-based interaction to improve visualization understanding and question answering accuracy in chart analysis.
Contribution
It proposes a novel framework that integrates structured specification querying and view manipulation, enhancing visual grounding beyond pixel interpretation.
Findings
Introspection improves data reconstruction fidelity.
Interaction combined with introspection achieves 81% QA accuracy.
+6.7% gains on overlapping geometries in visual questions.
Abstract
Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
