T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression
Arash Amini, Arul Selvam Periyasamy, and Sven Behnke

TL;DR
T6D-Direct introduces a transformer-based, real-time, single-stage approach for multi-object 6D pose estimation that achieves competitive accuracy with the fastest inference times on the YCB-Video dataset.
Contribution
It adapts the DETR transformer architecture for direct 6D pose regression, enabling efficient multi-object pose estimation without traditional detection components.
Findings
Achieves real-time inference speed.
Provides pose estimation accuracy comparable to state-of-the-art.
Demonstrates effectiveness on the YCB-Video dataset.
Abstract
6D pose estimation is the task of predicting the translation and orientation of objects in a given input image, which is a crucial prerequisite for many robotics and augmented reality applications. Lately, the Transformer Network architecture, equipped with a multi-head self-attention mechanism, is emerging to achieve state-of-the-art results in many computer vision tasks. DETR, a Transformer-based model, formulated object detection as a set prediction problem and achieved impressive results without standard components like region of interest pooling, non-maximal suppression, and bounding box proposals. In this work, we propose T6D-Direct, a real-time single-stage direct method with a transformer-based architecture built on DETR to perform 6D multi-object pose direct estimation. We evaluate the performance of our method on the YCB-Video dataset. Our method achieves the fastest inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Label Smoothing · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · Convolution
