TL;DR
This paper introduces LogicBench, a large benchmark for evaluating the logical reasoning abilities of vision-language models, and proposes LogicCLIP, a training framework that significantly improves their logical understanding without sacrificing general performance.
Contribution
The paper presents a new benchmark for logical reasoning in vision-language models and a novel training method that enhances logical sensitivity while maintaining overall performance.
Findings
VLMs perform 40 points below humans in logical tasks
LogicCLIP improves logical reasoning across multiple domains
Enhanced logical understanding does not reduce general performance
Abstract
Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
