CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers
Pedro Castro, Tae-Kyun Kim

TL;DR
CRT-6D introduces a fast, transformer-based 6D object pose estimation method that uses sparse keypoint features and iterative refinement, achieving state-of-the-art accuracy with significantly improved speed.
Contribution
The paper proposes CRT-6D, a novel cascaded transformer approach using sparse surface keypoints for efficient and accurate 6D pose estimation, outperforming existing real-time methods.
Findings
Inference runtime is 2x faster than closest real-time methods.
Supports up to 21 objects simultaneously.
Achieves state-of-the-art accuracy on LM-O and YCB-V datasets.
Abstract
Learning based 6D object pose estimation methods rely on computing large intermediate pose representations and/or iteratively refining an initial estimation with a slow render-compare pipeline. This paper introduces a novel method we call Cascaded Pose Refinement Transformers, or CRT-6D. We replace the commonly used dense intermediate representation with a sparse set of features sampled from the feature pyramid we call OSKFs(Object Surface Keypoint Features) where each element corresponds to an object keypoint. We employ lightweight deformable transformers and chain them together to iteratively refine proposed poses over the sampled OSKFs. We achieve inference runtimes 2x faster than the closest real-time state of the art methods while supporting up to 21 objects on a single model. We demonstrate the effectiveness of CRT-6D by performing extensive experiments on the LM-O and YCBV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
CRT6D : Fast 6D Object Pose Estimation with Cascaded Refinement Transformers· youtube
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
