TL;DR
TAIHRI is a novel vision-language model designed for precise 3D localization of task-relevant human body parts in close-range human-robot interaction, enabling more natural and safe robot responses.
Contribution
It introduces the first VLM tailored for HRI perception that localizes critical body parts in 3D space using 2D reasoning and adapts to downstream tasks.
Findings
Achieves superior accuracy in localizing task-critical body parts.
Effectively adapts to natural language commands and global space recovery.
Demonstrates effectiveness on egocentric interaction benchmarks.
Abstract
Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
