Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context

Samarth Singhal; Sandeep Singhal

arXiv:2506.12683·cs.CV·June 17, 2025

Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context

Samarth Singhal, Sandeep Singhal

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the performance of prominent vision-language models like GPT-4.1 and Gemini 2.5 Pro in histopathology cell typing tasks, revealing their current limitations compared to specialized CNNs despite improvements with one-shot prompting.

Contribution

It provides a comprehensive assessment of VLMs in a specialized domain, highlighting their potential and current shortcomings in histopathology image classification.

Findings

01

One-shot prompting improves VLM performance significantly.

02

VLMs underperform compared to supervised CNNs on most tasks.

03

VLMs show promise but have limitations in specialized domains.

Abstract

Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs). This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, accessed via APIs, for histopathology image classification tasks, including cell typing. Using diverse datasets from public and private sources, we apply zero-shot and one-shot prompting methods to assess VLM performance, comparing them against custom-trained Convolutional Neural Networks (CNNs). Our findings demonstrate that while one-shot prompting significantly improves VLM performance over zero-shot ( $p \approx 1.005 \times 1 0^{- 5}$ based on Kappa scores), these general-purpose VLMs currently underperform supervised CNNs on most tasks. This work underscores both the promise and limitations of applying current VLMs to specialized domains like pathology via in-context learning. All code and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

a12dongithub/vlmcce
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · GPT-4