Can Argus Judge Them All? Comparing VLMs Across Domains

Harsh Joshi; Gautam Siddharth Kashyap; Rafiq Ali; Ebad Shabbir; Niharika Jain; Sarthak Jain; Jiechao Gao; Usman Naseem

arXiv:2507.01042·cs.IR·July 3, 2025

Can Argus Judge Them All? Comparing VLMs Across Domains

Harsh Joshi, Gautam Siddharth Kashyap, Rafiq Ali, Ebad Shabbir, Niharika Jain, Sarthak Jain, Jiechao Gao, Usman Naseem

PDF

Open Access

TL;DR

This paper benchmarks major vision-language models across diverse tasks and datasets, revealing their strengths, weaknesses, and trade-offs in generalization, specialization, and robustness to guide future development and deployment.

Contribution

It introduces a comprehensive benchmarking framework including a novel Cross-Dataset Consistency metric to evaluate VLMs across multiple domains and tasks.

Findings

01

CLIP shows strongest generalization (CDC: 0.92)

02

BLIP excels on curated data

03

LXMERT leads in structured reasoning

Abstract

Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling