Recasting Generic Pretrained Vision Transformers As Object-Centric Scene   Encoders For Manipulation Policies

Jianing Qian; Anastasios Panagopoulos; Dinesh Jayaraman

arXiv:2405.15916·cs.CV·May 28, 2024

Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Jianing Qian, Anastasios Panagopoulos, Dinesh Jayaraman

PDF

Open Access

TL;DR

This paper introduces SOFT, a method that transforms pre-trained vision transformers into object-centric scene encoders for robotic manipulation, significantly improving task performance without additional training.

Contribution

SOFT is a novel wrapper that individuates and describes object-like entities from pre-trained vision transformers, enhancing their utility for robotic manipulation tasks.

Findings

01

SOFT representations outperform standard PVT in manipulation tasks.

02

Policies trained on SOFT(PVT) approach state-of-the-art robotics-aware representations.

03

The method works across various pre-trained vision transformers without additional training.

Abstract

Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsAttention Is All You Need · Dense Connections · Softmax · Layer Normalization · Spatial-Reduction Attention · Linear Layer · Multi-Head Attention · Residual Connection · Absolute Position Encodings · Vision Transformer