Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor
Junwen Chen, Yingcheng Wang, Keiji Yanai

TL;DR
This paper introduces a novel SOV decoding framework with specific denoising and a vision-language advisor to improve human-object interaction detection, achieving state-of-the-art results with faster training.
Contribution
The paper proposes a disentangled SOV decoding approach, a specific target guided denoising strategy, and a vision-language advisor to enhance HOID performance and training efficiency.
Findings
Achieves state-of-the-art performance on HOID benchmarks.
Reduces training epochs to one-sixth of previous methods.
Demonstrates faster convergence and higher accuracy.
Abstract
Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of Vision-Language Model (VLM). However, these methods suffer from extended training times and complex optimization due to the entanglement of object detection and HOI recognition during the decoding process. Especially, the query embeddings used to predict both labels and boxes suffer from ambiguous representations, and the gap between the prediction of HOI labels and verb labels is not considered. To address these challenges, we introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation. The STG denoising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
