TRec: Learning Hand-Object Interactions through 2D Point Track Motion
Dennis Holzmann, Sven Wachsmuth

TL;DR
This paper introduces TRec, a novel method for hand-object action recognition that uses 2D point tracks as motion cues, improving accuracy without relying on explicit hand or object detection.
Contribution
The work demonstrates that tracking randomly sampled points across frames with CoTracker and using these trajectories in a Transformer model enhances hand-object interaction recognition.
Findings
Point tracks improve recognition accuracy.
Method works with minimal video input.
Lightweight approach without explicit detection.
Abstract
We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning
