Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge
Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

TL;DR
This paper presents a top-performing model for visual relationship detection that combines language bias, spatial features, and feature fusion techniques, achieving first place in a Kaggle challenge.
Contribution
It introduces a novel combination of language bias, spatial features, and feature fusion methods that significantly improve relationship detection performance.
Findings
Language bias baseline is highly effective.
Spatial features are crucial for spatial relationships.
Feature fusion enhances overall model accuracy.
Abstract
This article describes the model we built that achieved 1st place in the OpenImage Visual Relationship Detection Challenge on Kaggle. Three key factors contribute the most to our success: 1) language bias is a powerful baseline for this task. We build the empirical distribution in the training set and directly use that in testing. This baseline achieved the 2nd place when submitted; 2) spatial features are as important as visual features, especially for spatial relationships such as "under" and "inside of"; 3) It is a very effective way to fuse different features by first building separate modules for each of them, then adding their output logits before the final softmax layer. We show in ablation study that each factor can improve the performance to a non-trivial extent, and the model reaches optimal when all of them are combined.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
MethodsSoftmax
