METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

Yongqi Wang; Xinxiao Wu; Shuo Yang

arXiv:2505.06663·cs.CV·May 13, 2025

METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

Yongqi Wang, Xinxiao Wu, Shuo Yang

PDF

Open Access 1 Repo

TL;DR

METOR introduces a unified, query-based framework that jointly enhances object detection and relationship classification in open-vocabulary video analysis, leveraging CLIP and iterative refinement to improve recognition of novel categories.

Contribution

The paper proposes a novel mutual enhancement framework for open-vocabulary video relationship detection, integrating a CLIP-based encoding and iterative refinement to improve accuracy and generalization.

Findings

01

Achieves state-of-the-art results on VidVRD and VidOR datasets.

02

Effectively models interdependence between objects and relationships.

03

Enhances recognition of novel categories in open-vocabulary scenarios.

Abstract

Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangyongqi558/METOR
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training