Pointing-Based Object Recognition

Luk\'a\v{s} Hajd\'uch; Viktor Kocur

arXiv:2603.15403·cs.CV·March 17, 2026

Pointing-Based Object Recognition

Luk\'a\v{s} Hajd\'uch, Viktor Kocur

PDF

Open Access

TL;DR

This paper introduces a modular RGB-based object recognition pipeline that leverages human pointing gestures, integrating object detection, pose estimation, depth, and vision-language models to improve accuracy in complex scenes.

Contribution

The paper presents a novel RGB-only recognition system that effectively combines multiple state-of-the-art methods for human pointing gesture interpretation.

Findings

01

Depth information significantly improves recognition accuracy.

02

Image captioning helps correct classification errors.

03

Modular approach enables deployment without specialized sensors.

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Multimodal Machine Learning Applications · Human Pose and Action Recognition