VideoPose: Estimating 6D object pose from videos

Apoorva Beedu; Zhile Ren; Varun Agrawal; Irfan Essa

arXiv:2111.10677·cs.CV·November 23, 2021

VideoPose: Estimating 6D object pose from videos

Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa

PDF

Open Access 1 Repo

TL;DR

VideoPose introduces a CNN-based method that uses temporal information from videos to estimate 6D object poses efficiently and accurately, suitable for real-time robotic and AR applications.

Contribution

The paper presents a novel approach combining CNNs and recurrent networks to estimate 6D object pose directly from videos, improving efficiency and robustness over existing methods.

Findings

01

Achieves state-of-the-art accuracy on YCB-Video dataset.

02

Operates at 30 fps, enabling real-time applications.

03

Outperforms previous methods in speed while maintaining accuracy.

Abstract

We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos. Our approach leverages the temporal information from a video sequence, and is computationally efficient and robust to support robotic and AR domains. Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame. Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms. Further, with a speed of 30 fps, it is also more efficient than the state-of-the-art, and therefore applicable to a variety of applications that require real-time object pose estimation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apoorvabeedu/videopose
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings