Video based Object 6D Pose Estimation using Transformers

Apoorva Beedu; Huda Alamri; Irfan Essa

arXiv:2210.13540·cs.CV·September 6, 2023·6 cites

Video based Object 6D Pose Estimation using Transformers

Apoorva Beedu, Huda Alamri, Irfan Essa

PDF

Open Access 1 Repo

TL;DR

This paper presents VideoPose, a Transformer-based framework for 6D object pose estimation in videos that leverages temporal information for accurate, efficient, and real-time pose refinement, outperforming CNN-based methods.

Contribution

The paper introduces a novel end-to-end Transformer architecture that effectively captures long-range dependencies in video sequences for 6D pose estimation.

Findings

01

Achieves state-of-the-art performance on YCB-Video dataset

02

Operates at 33 fps for real-time applications

03

Outperforms CNN-based approaches in accuracy and efficiency

Abstract

We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apoorvabeedu/videopose
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Vision and Imaging

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization