Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens   in 3D Space

Jinghuan Shang; Srijan Das; Michael S. Ryoo

arXiv:2206.11895·cs.CV·January 16, 2023·5 cites

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Jinghuan Shang, Srijan Das, Michael S. Ryoo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a 3D Token Representation Layer (3DTRL) that estimates 3D positional information of visual tokens to improve viewpoint-agnostic visual understanding in Transformer models, enhancing performance across multiple vision tasks.

Contribution

The paper proposes a novel 3DTRL module that estimates 3D token positions using unsupervised learning, enabling Transformers to learn viewpoint-invariant representations from 2D images.

Findings

01

3DTRL improves accuracy in image classification, multi-view video alignment, and action recognition.

02

Models with 3DTRL outperform baseline Transformers with minimal additional computation.

03

The approach effectively recovers 3D positional info from 2D patches in an unsupervised manner.

Abstract

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

elicassion/3dtrl
pytorchOfficial

Videos

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Layer Normalization · Position-Wise Feed-Forward Layer