METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection
Yongqi Wang, Xinxiao Wu, Shuo Yang

TL;DR
METOR introduces a unified, query-based framework that jointly enhances object detection and relationship classification in open-vocabulary video analysis, leveraging CLIP and iterative refinement to improve recognition of novel categories.
Contribution
The paper proposes a novel mutual enhancement framework for open-vocabulary video relationship detection, integrating a CLIP-based encoding and iterative refinement to improve accuracy and generalization.
Findings
Achieves state-of-the-art results on VidVRD and VidOR datasets.
Effectively models interdependence between objects and relationships.
Enhances recognition of novel categories in open-vocabulary scenarios.
Abstract
Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training
