Evaluating the encoding competence of visual language models using uncommon actions

Chen Ling; Nai Ding

arXiv:2601.07737·cs.CV·January 13, 2026

Evaluating the encoding competence of visual language models using uncommon actions

Chen Ling, Nai Ding

PDF

Open Access

TL;DR

This paper introduces UAIT, a benchmark dataset for evaluating visual language models' understanding of uncommon-sense actions, revealing their limitations and potential for improvement in semantic reasoning.

Contribution

The paper presents UAIT, a novel dataset for testing VLMs on uncommon-sense actions, and provides insights into their semantic reasoning weaknesses and adaptation potential.

Findings

01

Models perform worse than humans on semantic judgment tasks.

02

Fine-tuning improves model accuracy in uncommon-sense reasoning.

03

The dataset reveals key weaknesses in current VLMs' understanding.

Abstract

We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning