Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene   Graphs with Language Structures via Dependency Relationships

Chao Lou; Wenjuan Han; Yuhuan Lin; Zilong Zheng

arXiv:2203.14260·cs.CV·June 2, 2022

Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Chao Lou, Wenjuan Han, Yuhuan Lin, Zilong Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised approach to jointly model visual scene graphs and language dependency structures, creating a new dataset and a contrastive learning framework that improve language grammar induction and phrase grounding.

Contribution

It presents the first unsupervised method for bridging visual scene graphs with linguistic dependency trees, along with a new dataset VLParse and a contrastive learning model VLGAE.

Findings

01

VLGAE outperforms baselines on grammar induction

02

Visual cues and dependency relationships enhance VL structure accuracy

03

The dataset VLParse enables future research in joint VL structure learning

Abstract

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bigai-research/vlgae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning