Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies
Jianing Qian, Anastasios Panagopoulos, Dinesh Jayaraman

TL;DR
This paper introduces SOFT, a method that transforms pre-trained vision transformers into object-centric scene encoders for robotic manipulation, significantly improving task performance without additional training.
Contribution
SOFT is a novel wrapper that individuates and describes object-like entities from pre-trained vision transformers, enhancing their utility for robotic manipulation tasks.
Findings
SOFT representations outperform standard PVT in manipulation tasks.
Policies trained on SOFT(PVT) approach state-of-the-art robotics-aware representations.
The method works across various pre-trained vision transformers without additional training.
Abstract
Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
MethodsAttention Is All You Need · Dense Connections · Softmax · Layer Normalization · Spatial-Reduction Attention · Linear Layer · Multi-Head Attention · Residual Connection · Absolute Position Encodings · Vision Transformer
