Focusing on what to decode and what to train: SOV Decoding with Specific   Target Guided DeNoising and Vision Language Advisor

Junwen Chen; Yingcheng Wang; Keiji Yanai

arXiv:2307.02291·cs.CV·December 24, 2024·1 cites

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor

Junwen Chen, Yingcheng Wang, Keiji Yanai

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel SOV decoding framework with specific denoising and a vision-language advisor to improve human-object interaction detection, achieving state-of-the-art results with faster training.

Contribution

The paper proposes a disentangled SOV decoding approach, a specific target guided denoising strategy, and a vision-language advisor to enhance HOID performance and training efficiency.

Findings

01

Achieves state-of-the-art performance on HOID benchmarks.

02

Reduces training epochs to one-sixth of previous methods.

03

Demonstrates faster convergence and higher accuracy.

Abstract

Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of Vision-Language Model (VLM). However, these methods suffer from extended training times and complex optimization due to the entanglement of object detection and HOI recognition during the decoding process. Especially, the query embeddings used to predict both labels and boxes suffer from ambiguous representations, and the gap between the prediction of HOI labels and verb labels is not considered. To address these challenges, we introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation. The STG denoising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling