Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection
Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci

TL;DR
This paper introduces DYSCO, a training-free framework that leverages multimodal semantic representations and a novel attention mechanism to improve human-object interaction detection, especially for rare interactions.
Contribution
It proposes a new training-free HOI detection method that enhances semantic alignment using multimodal interaction signatures and a multi-head attention mechanism.
Findings
DYSCO outperforms existing training-free models in HOI detection.
It achieves competitive results with training-based approaches.
Particularly effective in recognizing rare interactions.
Abstract
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
