The Overlooked Classifier in Human-Object Interaction Recognition
Ying Jin, Yinpeng Chen, Lijuan Wang, Jianfeng Wang, Pei Yu, Lin Liang,, Jenq-Neng Hwang, Zicheng Liu

TL;DR
This paper improves human-object interaction recognition by enhancing the classifier with semantic embeddings and a new loss, enabling detection-free classification and state-of-the-art performance without additional fine-tuning.
Contribution
It introduces a novel classifier enhancement using language embeddings and a new loss function to address class imbalance and multi-label challenges in HOI recognition.
Findings
Significant performance boost, especially in few-shot scenarios.
Outperforms state-of-the-art methods requiring object detection.
Achieves state-of-the-art results in instance-level HOI detection without fine-tuning.
Abstract
Human-Object Interaction (HOI) recognition is challenging due to two factors: (1) significant imbalance across classes and (2) requiring multiple labels per image. This paper shows that these two challenges can be effectively addressed by improving the classifier with the backbone architecture untouched. Firstly, we encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs. As a result, the performance is boosted significantly, especially for the few-shot subset. Secondly, we propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset. Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin. Moreover, we transfer the classification model to instance-level HOI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
