Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu; Chenhui Zhao; Soumyanil Banerjee; Shixuan Liu; Akshay Rao; Akhil Kondepudi; Honglak Lee; Todd C. Hollon

arXiv:2512.11141·cs.CV·March 18, 2026

Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon

PDF

Open Access

TL;DR

This paper introduces ItemizedCLIP, a framework that learns complete, explainable, and semantically grounded visual representations from itemized text supervision across various domains, improving zero-shot and interpretability performance.

Contribution

The paper proposes ItemizedCLIP, a novel method that effectively leverages itemized text annotations to produce comprehensive and interpretable visual embeddings, addressing a gap in non-object-centric domain supervision.

Findings

01

Significant zero-shot performance improvements across multiple domains.

02

Enhanced interpretability and item-differentiability of visual representations.

03

Effective modeling of item independence and coverage in visual embeddings.

Abstract

Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling