A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro; Alessandro Sebastiano Catinello; Michele Mazzamuto; Rosario Leonardi; Antonino Furnari; Giovanni Maria Farinella

arXiv:2507.13326·cs.CV·December 8, 2025

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto, Rosario Leonardi, Antonino Furnari, Giovanni Maria Farinella

PDF

Open Access

TL;DR

This paper introduces a real-time egocentric hand-object interaction detection system for industrial settings, combining action recognition and object detection modules to improve speed and accuracy.

Contribution

It presents a cascaded architecture with a novel action recognition model and a fine-tuned object detector, achieving high performance at 30fps in real-time industrial scenarios.

Findings

01

Mamba model achieves 38.52% p-AP on ENIGMA-51

02

YOLOWorld reaches 85.13% AP for hand and object detection

03

System operates effectively in real-time at 30fps

Abstract

Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition