6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning
Lu Zou, Zhangjin Huang, Naijie Gu, Guoping Wang

TL;DR
This paper introduces 6D-ViT, a transformer-based network that leverages multi-source instance representations from RGB images, point clouds, and shape priors to achieve highly accurate category-level 6D object pose estimation.
Contribution
The paper proposes a novel two-stream transformer framework, Pixelformer and Pointformer, for learning comprehensive instance representations from multiple data sources.
Findings
Achieves state-of-the-art results on synthetic and real datasets.
Significantly outperforms existing methods in 6D pose estimation.
Demonstrates robustness across diverse scenarios.
Abstract
This paper presents 6D-ViT, a transformer-based instance representation learning network, which is suitable for highly accurate category-level object pose estimation on RGB-D images. Specifically, a novel two-stream encoder-decoder framework is dedicated to exploring complex and powerful instance representations from RGB images, point clouds and categorical shape priors. For this purpose, the whole framework consists of two main branches, named Pixelformer and Pointformer. The Pixelformer contains a pyramid transformer encoder with an all-MLP decoder to extract pixelwise appearance representations from RGB images, while the Pointformer relies on a cascaded transformer encoder and an all-MLP decoder to acquire the pointwise geometric characteristics from point clouds. Then, dense instance representations (i.e., correspondence matrix, deformation field) are obtained from a multi-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
