YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation
Arul Selvam Periyasamy, Arash Amini, Vladimir Tsaturyan, and Sven, Behnke

TL;DR
This paper introduces YOLOPose V2, a Transformer-based approach for real-time 6D object pose estimation that directly regresses keypoints and predicts orientations, achieving competitive accuracy with state-of-the-art methods.
Contribution
The work presents a novel Transformer-based architecture for joint object detection and 6D pose estimation, with direct keypoint regression and an orientation module, improving efficiency and accuracy.
Findings
Achieves real-time 6D pose estimation with competitive accuracy.
Object queries specialize in detecting objects in specific regions.
Smaller datasets can still produce effective pose estimates.
Abstract
6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression and an improved variant of the YOLOPose model. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
