RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan,, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

TL;DR
RLIPv2 introduces a fast-converging relational pre-training model that scales effectively to large pseudo-labelled scene graph data, significantly improving relational reasoning in vision tasks with state-of-the-art results.
Contribution
The paper presents RLIPv2, a novel model with ALIF for faster training and a method to generate large-scale relational data, enabling scalable and efficient relational pre-training.
Findings
Achieves state-of-the-art on Human-Object Interaction Detection benchmarks.
Yields high performance with minimal data, e.g., 32.22mAP with 1% data.
Demonstrates fast convergence and superior relational reasoning capabilities.
Abstract
Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsALIGN
