Pointing-Based Object Recognition
Luk\'a\v{s} Hajd\'uch, Viktor Kocur

TL;DR
This paper introduces a modular RGB-based object recognition pipeline that leverages human pointing gestures, integrating object detection, pose estimation, depth, and vision-language models to improve accuracy in complex scenes.
Contribution
The paper presents a novel RGB-only recognition system that effectively combines multiple state-of-the-art methods for human pointing gesture interpretation.
Findings
Depth information significantly improves recognition accuracy.
Image captioning helps correct classification errors.
Modular approach enables deployment without specialized sensors.
Abstract
This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Multimodal Machine Learning Applications · Human Pose and Action Recognition
