Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

Quan Bi Pay; Vishnu Monn Baskaran; Junn Yong Loo; KokSheik Wong; Simon See

arXiv:2507.10977·cs.CV·July 16, 2025

Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See

PDF

Open Access

TL;DR

This paper introduces a novel wavelet attention backbone and ray-based encoder architecture to improve human-object interaction detection by enhancing feature aggregation and multi-scale attention while reducing computational costs.

Contribution

It proposes a wavelet attention-like backbone and a ray-based encoder architecture specifically designed for more efficient and accurate HOI detection.

Findings

01

Improved accuracy on HICO-DET dataset

02

Enhanced feature aggregation from diverse convolutional filters

03

Reduced computational overhead in HOI detection

Abstract

Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection