Learning Visual Composition through Improved Semantic Guidance
Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao,, Kaifeng Chen, Abhijit Ogale, Jonathon Shlens

TL;DR
This paper shows that improving weakly labeled data like captions can significantly enhance contrastive learning models' ability to understand visual composition, surpassing specialized architectures.
Contribution
The study demonstrates that simple data enhancement techniques can vastly improve contrastive learning models' compositional understanding without complex architectures.
Findings
Enhanced caption data boosts CLIP performance on compositional tasks.
Standard CLIP with improved data outperforms bespoke architectures.
Impressive results on a new captioning benchmark from DOCCI.
Abstract
Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual and Cognitive Learning Processes
MethodsContrastive Learning · Contrastive Language-Image Pre-training · Focus
