Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Chao Lou, Wenjuan Han, Yuhuan Lin, Zilong Zheng

TL;DR
This paper introduces an unsupervised approach to jointly model visual scene graphs and language dependency structures, creating a new dataset and a contrastive learning framework that improve language grammar induction and phrase grounding.
Contribution
It presents the first unsupervised method for bridging visual scene graphs with linguistic dependency trees, along with a new dataset VLParse and a contrastive learning model VLGAE.
Findings
VLGAE outperforms baselines on grammar induction
Visual cues and dependency relationships enhance VL structure accuracy
The dataset VLParse enables future research in joint VL structure learning
Abstract
Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
