6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based   Instance Representation Learning

Lu Zou; Zhangjin Huang; Naijie Gu; Guoping Wang

arXiv:2110.04792·cs.CV·November 7, 2022

6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning

Lu Zou, Zhangjin Huang, Naijie Gu, Guoping Wang

PDF

TL;DR

This paper introduces 6D-ViT, a transformer-based network that leverages multi-source instance representations from RGB images, point clouds, and shape priors to achieve highly accurate category-level 6D object pose estimation.

Contribution

The paper proposes a novel two-stream transformer framework, Pixelformer and Pointformer, for learning comprehensive instance representations from multiple data sources.

Findings

01

Achieves state-of-the-art results on synthetic and real datasets.

02

Significantly outperforms existing methods in 6D pose estimation.

03

Demonstrates robustness across diverse scenarios.

Abstract

This paper presents 6D-ViT, a transformer-based instance representation learning network, which is suitable for highly accurate category-level object pose estimation on RGB-D images. Specifically, a novel two-stream encoder-decoder framework is dedicated to exploring complex and powerful instance representations from RGB images, point clouds and categorical shape priors. For this purpose, the whole framework consists of two main branches, named Pixelformer and Pointformer. The Pixelformer contains a pyramid transformer encoder with an all-MLP decoder to extract pixelwise appearance representations from RGB images, while the Pointformer relies on a cascaded transformer encoder and an all-MLP decoder to acquire the pointwise geometric characteristics from point clouds. Then, dense instance representations (i.e., correspondence matrix, deformation field) are obtained from a multi-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.