Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal   Structured Representations

Yufeng Huang; Jiji Tang; Zhuo Chen; Rongsheng Zhang; Xinfeng Zhang,; Weijie Chen; Zeng Zhao; Zhou Zhao; Tangjie Lv; Zhipeng Hu; Wen Zhang

arXiv:2305.06152·cs.CL·December 14, 2023·1 cites

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang,, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang

PDF

Open Access 3 Repos 1 Video

TL;DR

Structure-CLIP introduces scene graph knowledge into vision-language pre-training to improve structured multi-modal representations, significantly boosting performance on scene graph-related tasks while maintaining generalization.

Contribution

The paper proposes an end-to-end framework that integrates scene graph knowledge into CLIP, including semantic negative example construction and a Knowledge-Enhance Encoder, to better capture structured representations.

Findings

01

Achieves SOTA performance on VG-Attribution and VG-Relation datasets.

02

Significantly improves structured representation quality on MSCOCO.

03

Maintains general multi-modal understanding capabilities.

Abstract

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Methodsfail