Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

Patrick Batsell; Satoshi Tsutsui; Bihan Wen

arXiv:2512.18951·cs.LG·May 14, 2026

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

Patrick Batsell, Satoshi Tsutsui, Bihan Wen

PDF

TL;DR

This paper introduces a benchmark to evaluate infant-scale vision-language models' ability to discriminate visual attributes like color, size, and texture, revealing strengths and weaknesses in visual and linguistic grounding.

Contribution

It presents a controlled benchmark for attribute discrimination and evaluates infant-trained versus web-scale models, highlighting differences in visual and linguistic attribute grounding.

Findings

01

Infant-trained models excel at size discrimination but struggle with color.

02

Web-trained models strongly ground color from text but are weaker in size discrimination.

03

Models show a dissociation between visual and linguistic attribute representations.

Abstract

Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.