Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Itay Cohen, Ethan Fetaya, Amir Rosenfeld

TL;DR
This paper investigates whether modern vision-language models like CLIP can distinguish between real objects and look-alikes, introducing a dataset and methods to evaluate and improve this subtle perceptual ability.
Contribution
The authors create the RoLA dataset and develop a direction in CLIP's embedding space to better differentiate real objects from look-alikes, enhancing model discrimination capabilities.
Findings
Improved cross-modal retrieval accuracy on Conceptual12M.
Enhanced captioning quality with the proposed embedding direction.
Demonstrated that CLIP can be fine-tuned to recognize look-alikes.
Abstract
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
