Learning Visual Composition through Improved Semantic Guidance

Austin Stone; Hagen Soltau; Robert Geirhos; Xi Yi; Ye Xia; Bingyi Cao,; Kaifeng Chen; Abhijit Ogale; Jonathon Shlens

arXiv:2412.15396·cs.CV·April 7, 2025

Learning Visual Composition through Improved Semantic Guidance

Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao,, Kaifeng Chen, Abhijit Ogale, Jonathon Shlens

PDF

Open Access

TL;DR

This paper shows that improving weakly labeled data like captions can significantly enhance contrastive learning models' ability to understand visual composition, surpassing specialized architectures.

Contribution

The study demonstrates that simple data enhancement techniques can vastly improve contrastive learning models' compositional understanding without complex architectures.

Findings

01

Enhanced caption data boosts CLIP performance on compositional tasks.

02

Standard CLIP with improved data outperforms bespoke architectures.

03

Impressive results on a new captioning benchmark from DOCCI.

Abstract

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual and Cognitive Learning Processes

MethodsContrastive Learning · Contrastive Language-Image Pre-training · Focus