Unified Visual Relationship Detection with Vision and Language Models

Long Zhao; Liangzhe Yuan; Boqing Gong; Yin Cui; Florian Schroff,; Ming-Hsuan Yang; Hartwig Adam; Ting Liu

arXiv:2303.08998·cs.CV·August 22, 2023·1 cites

Unified Visual Relationship Detection with Vision and Language Models

Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff,, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces UniVRD, a unified approach leveraging vision and language models to detect visual relationships across multiple datasets, improving performance and generalization in scene understanding tasks.

Contribution

The paper proposes a novel bottom-up method, UniVRD, that unifies visual relationship detection across datasets using VLMs, addressing label inconsistency and enhancing performance.

Findings

01

Achieves 38.07 mAP on HICO-DET, surpassing previous methods by 14.26 mAP.

02

Performs comparably to dataset-specific models, with further improvements when scaled.

03

Demonstrates effectiveness in human-object interaction detection and scene-graph generation.

Abstract

This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/scenic
jaxOfficial

Models

🤗
fcxfcx/owlv2
model· ♡ 1
♡ 1

Videos

Unified Visual Relationship Detection with Vision and Language Models· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning