Video based Object 6D Pose Estimation using Transformers
Apoorva Beedu, Huda Alamri, Irfan Essa

TL;DR
This paper presents VideoPose, a Transformer-based framework for 6D object pose estimation in videos that leverages temporal information for accurate, efficient, and real-time pose refinement, outperforming CNN-based methods.
Contribution
The paper introduces a novel end-to-end Transformer architecture that effectively captures long-range dependencies in video sequences for 6D pose estimation.
Findings
Achieves state-of-the-art performance on YCB-Video dataset
Operates at 33 fps for real-time applications
Outperforms CNN-based approaches in accuracy and efficiency
Abstract
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization
