Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection
Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See

TL;DR
This paper introduces a novel wavelet attention backbone and ray-based encoder architecture to improve human-object interaction detection by enhancing feature aggregation and multi-scale attention while reducing computational costs.
Contribution
It proposes a wavelet attention-like backbone and a ray-based encoder architecture specifically designed for more efficient and accurate HOI detection.
Findings
Improved accuracy on HICO-DET dataset
Enhanced feature aggregation from diverse convolutional filters
Reduced computational overhead in HOI detection
Abstract
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection
