Image2Struct: Benchmarking Structure Extraction for Vision-Language   Models

Josselin Somerville Roberts; Tony Lee; Chi Heem Wong; Michihiro; Yasunaga; Yifan Mai; Percy Liang

arXiv:2410.22456·cs.CV·October 31, 2024

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro, Yasunaga, Yifan Mai, Percy Liang

PDF

Open Access 1 Repo

TL;DR

Image2Struct is a comprehensive benchmark that automatically evaluates vision-language models on their ability to extract and reconstruct structural information from images across multiple domains, using a round-trip similarity measure.

Contribution

We introduce Image2Struct, a fully automatic, multi-domain benchmark for evaluating structure extraction in vision-language models with a novel round-trip image comparison approach.

Findings

01

Scores vary widely across models, indicating performance differences.

02

Performance varies significantly across domains, showing varying task difficulty.

03

The benchmark effectively differentiates model capabilities.

Abstract

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanford-crfm/helm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques