Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Hyunsik Chae; Seungwoo Yoon; Jaden Park; Chloe Yewon Chun; Yongin Cho; Mu Cai; Yong Jae Lee; Ernest K. Ryu

arXiv:2505.20021·cs.CV·May 27, 2025

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Hyunsik Chae, Seungwoo Yoon, Jaden Park, Chloe Yewon Chun, Yongin Cho, Mu Cai, Yong Jae Lee, Ernest K. Ryu

PDF

Open Access

TL;DR

This paper identifies fundamental visual perception skills, creates a dataset to evaluate them, and reveals that current vision-language models underperform on these basic tasks, indicating a need for specialized training datasets.

Contribution

The paper introduces the Atomic Visual Skills Dataset (AVSD) and systematically categorizes atomic visual skills for evaluating VLMs.

Findings

01

VLMs struggle with atomic visual skills despite their simplicity for humans.

02

Current VLMs underperform on basic geometric tasks.

03

Highlights the need for datasets focused on atomic visual perception.

Abstract

Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Machine Learning in Materials Science

MethodsFocus