RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang,, Dong Ni, Mingqian Tang

TL;DR
RLIP introduces a novel contrastive pre-training strategy leveraging entity and relation descriptions, along with a new architecture and data augmentation techniques, to enhance human-object interaction detection in various learning scenarios.
Contribution
The paper proposes RLIP, a comprehensive pre-training framework with a new architecture, data generation, and noise mitigation methods for improved HOI detection.
Findings
Enhanced zero-shot and few-shot HOI detection performance.
Improved robustness to noisy annotations.
Significant gains in fine-tuning accuracy.
Abstract
The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsVisual Parsing
