Evaluating the encoding competence of visual language models using uncommon actions
Chen Ling, Nai Ding

TL;DR
This paper introduces UAIT, a benchmark dataset for evaluating visual language models' understanding of uncommon-sense actions, revealing their limitations and potential for improvement in semantic reasoning.
Contribution
The paper presents UAIT, a novel dataset for testing VLMs on uncommon-sense actions, and provides insights into their semantic reasoning weaknesses and adaptation potential.
Findings
Models perform worse than humans on semantic judgment tasks.
Fine-tuning improves model accuracy in uncommon-sense reasoning.
The dataset reveals key weaknesses in current VLMs' understanding.
Abstract
We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
