VLGrammar: Grounded Grammar Induction of Vision and Language
Yining Hong, Qing Li, Song-Chun Zhu, Siyuan Huang

TL;DR
VLGrammar introduces a joint learning framework for grounded grammar induction in vision and language using compound PCFGs, demonstrating superior performance and generalizability on a new large-scale dataset.
Contribution
It proposes a novel contrastive learning approach for simultaneous induction of language and image grammars grounded in visual structures.
Findings
Outperforms baselines in image and language grammar induction
Improves image clustering accuracy by 30%
Generalizes well to unseen categories
Abstract
Cognitive grammar suggests that the acquisition of language grammar is grounded within visual structures. While grammar is an essential representation of natural language, it also exists ubiquitously in vision to represent the hierarchical part-whole structure. In this work, we study grounded grammar induction of vision and language in a joint learning framework. Specifically, we present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously. We propose a novel contrastive learning framework to guide the joint learning of both modules. To provide a benchmark for the grounded grammar induction task, we collect a large-scale dataset, \textsc{PartIt}, which contains human-written sentences that describe part-level semantics for 3D objects. Experiments on the \textsc{PartIt} dataset show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
