Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context
Samarth Singhal, Sandeep Singhal

TL;DR
This paper evaluates the performance of prominent vision-language models like GPT-4.1 and Gemini 2.5 Pro in histopathology cell typing tasks, revealing their current limitations compared to specialized CNNs despite improvements with one-shot prompting.
Contribution
It provides a comprehensive assessment of VLMs in a specialized domain, highlighting their potential and current shortcomings in histopathology image classification.
Findings
One-shot prompting improves VLM performance significantly.
VLMs underperform compared to supervised CNNs on most tasks.
VLMs show promise but have limitations in specialized domains.
Abstract
Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs). This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, accessed via APIs, for histopathology image classification tasks, including cell typing. Using diverse datasets from public and private sources, we apply zero-shot and one-shot prompting methods to assess VLM performance, comparing them against custom-trained Convolutional Neural Networks (CNNs). Our findings demonstrate that while one-shot prompting significantly improves VLM performance over zero-shot ( based on Kappa scores), these general-purpose VLMs currently underperform supervised CNNs on most tasks. This work underscores both the promise and limitations of applying current VLMs to specialized domains like pathology via in-context learning. All code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · GPT-4
