Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro, Yasunaga, Yifan Mai, Percy Liang

TL;DR
Image2Struct is a comprehensive benchmark that automatically evaluates vision-language models on their ability to extract and reconstruct structural information from images across multiple domains, using a round-trip similarity measure.
Contribution
We introduce Image2Struct, a fully automatic, multi-domain benchmark for evaluating structure extraction in vision-language models with a novel round-trip image comparison approach.
Findings
Scores vary widely across models, indicating performance differences.
Performance varies significantly across domains, showing varying task difficulty.
The benchmark effectively differentiates model capabilities.
Abstract
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques
