Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Yuchen Zhou; Jiayu Tang; Shuo Yang; Xiaoyan Xiao; Yuqin Dai; Wenhao Yang; Chao Gou; Xiaobo Xia; Tat-Seng Chua

arXiv:2508.11317·cs.CV·August 18, 2025

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Yuchen Zhou, Jiayu Tang, Shuo Yang, Xiaoyan Xiao, Yuqin Dai, Wenhao Yang, Chao Gou, Xiaobo Xia, Tat-Seng Chua

PDF

1 Video

TL;DR

This paper introduces LogicBench, a large benchmark for evaluating the logical reasoning abilities of vision-language models, and proposes LogicCLIP, a training framework that significantly improves their logical understanding without sacrificing general performance.

Contribution

The paper presents a new benchmark for logical reasoning in vision-language models and a novel training method that enhances logical sensitivity while maintaining overall performance.

Findings

01

VLMs perform 40 points below humans in logical tasks

02

LogicCLIP improves logical reasoning across multiple domains

03

Enhanced logical understanding does not reduce general performance

Abstract

Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models· underline