GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi

TL;DR
GENFIG1 is a new benchmark for evaluating AI models' ability to generate scientifically meaningful and visually coherent figures that summarize research ideas from scholarly papers.
Contribution
The paper introduces GENFIG1, a challenging benchmark for vision-language models to generate scientific figures that accurately and effectively communicate core research concepts.
Findings
Models struggle to generate accurate scientific figures.
The benchmark correlates well with expert human judgment.
Current models show significant room for improvement.
Abstract
In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
