Semantic Composition in Visually Grounded Language Models
Rohan Pandey

TL;DR
This paper investigates how visually grounded language models represent compositional semantics, introduces new benchmarks and methods to measure and improve this ability, and explores connections to cognitive sciences.
Contribution
It introduces novel benchmarks, measures, and techniques to evaluate and enhance compositional semantics in vision-language models.
Findings
Visual question answering benchmark for compositionality
Measures of compositional ability in sentence embeddings
Methods to improve vision-language semantic composition
Abstract
What is sentence meaning and its ideal representation? Much of the expressive power of human language derives from semantic composition, the mind's ability to represent meaning hierarchically & relationally over constituents. At the same time, much sentential meaning is outside the text and requires grounding in sensory, motor, and experiential modalities to be adequately learned. Although large language models display considerable compositional ability, recent work shows that visually-grounded language models drastically fail to represent compositional structure. In this thesis, we explore whether & how models compose visually grounded semantics, and how we might improve their ability to do so. Specifically, we introduce 1) WinogroundVQA, a new compositional visual question answering benchmark, 2) Syntactic Neural Module Distillation, a measure of compositional ability in sentence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
Methodsfail
