Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

Itay Cohen; Ethan Fetaya; Amir Rosenfeld

arXiv:2511.19200·cs.CV·November 26, 2025

Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

Itay Cohen, Ethan Fetaya, Amir Rosenfeld

PDF

Open Access

TL;DR

This paper investigates whether modern vision-language models like CLIP can distinguish between real objects and look-alikes, introducing a dataset and methods to evaluate and improve this subtle perceptual ability.

Contribution

The authors create the RoLA dataset and develop a direction in CLIP's embedding space to better differentiate real objects from look-alikes, enhancing model discrimination capabilities.

Findings

01

Improved cross-modal retrieval accuracy on Conceptual12M.

02

Enhanced captioning quality with the proposed embedding direction.

03

Demonstrated that CLIP can be fine-tuned to recognize look-alikes.

Abstract

Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning