GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Yaohan Guan; Pristina Wang; Najim Dehak; Alan Yuille; Jieneng Chen; Daniel Khashabi

arXiv:2604.04172·cs.CV·April 7, 2026

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi

PDF

TL;DR

GENFIG1 is a new benchmark for evaluating AI models' ability to generate scientifically meaningful and visually coherent figures that summarize research ideas from scholarly papers.

Contribution

The paper introduces GENFIG1, a challenging benchmark for vision-language models to generate scientific figures that accurately and effectively communicate core research concepts.

Findings

01

Models struggle to generate accurate scientific figures.

02

The benchmark correlates well with expert human judgment.

03

Current models show significant room for improvement.

Abstract

In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.