Seeing Syntax: Uncovering Syntactic Learning Limitations in   Vision-Language Models

Sri Harsha Dumpala; David Arps; Sageev Oore; Laura Kallmeyer; Hassan; Sajjad

arXiv:2412.08111·cs.CV·December 12, 2024

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan, Sajjad

PDF

Open Access

TL;DR

This paper investigates how vision-language models encode syntactic information, revealing that their syntactic understanding is primarily influenced by pre-training objectives and varies across model types and layers.

Contribution

It provides a comprehensive analysis of syntactic encoding in VLMs, highlighting the impact of training objectives over architecture or data size, and compares them with uni-modal language models.

Findings

01

ULMs encode syntactic information more effectively than VLMs.

02

Pre-training objectives significantly influence syntactic knowledge acquisition.

03

Layer-wise trends differ across models, with some showing decreased syntactic encoding in deeper layers.

Abstract

Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding, though the underlying reasons for these limitations remain unclear. In this work, we aim to address this gap by analyzing the syntactic information, one of the fundamental linguistic properties, encoded by the text encoders of VLMs. We perform a thorough analysis comparing VLMs with different objective functions, parameter size and training data size, and with uni-modal language models (ULMs) in their ability to encode syntactic knowledge. Our findings suggest that ULM text encoders acquire syntactic information more effectively than those in VLMs. The syntactic information learned by VLM text encoders is shaped…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training