VideoPose: Estimating 6D object pose from videos
Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa

TL;DR
VideoPose introduces a CNN-based method that uses temporal information from videos to estimate 6D object poses efficiently and accurately, suitable for real-time robotic and AR applications.
Contribution
The paper presents a novel approach combining CNNs and recurrent networks to estimate 6D object pose directly from videos, improving efficiency and robustness over existing methods.
Findings
Achieves state-of-the-art accuracy on YCB-Video dataset.
Operates at 30 fps, enabling real-time applications.
Outperforms previous methods in speed while maintaining accuracy.
Abstract
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos. Our approach leverages the temporal information from a video sequence, and is computationally efficient and robust to support robotic and AR domains. Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame. Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms. Further, with a speed of 30 fps, it is also more efficient than the state-of-the-art, and therefore applicable to a variety of applications that require real-time object pose estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
