Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor   Control

Dongyoon Hwang; Byungkun Lee; Hojoon Lee; Hyunseung Kim; Jaegul Choo

arXiv:2406.06072·cs.CV·June 11, 2024

Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

Dongyoon Hwang, Byungkun Lee, Hojoon Lee, Hyunseung Kim, Jaegul Choo

PDF

Open Access 1 Repo

TL;DR

This paper introduces Convolution Injector (CoIn), a module that enhances pretrained Vision Transformers with convolutional biases, significantly improving their performance in visuo-motor control tasks across multiple models and domains.

Contribution

The paper proposes CoIn, a novel add-on that injects convolutional biases into pretrained ViTs, enabling better adaptation for control tasks by incorporating locality and equivariance.

Findings

01

CoIn improves performance across all tested environments.

02

Pretrained ViTs with CoIn outperform baseline models.

03

The method is effective across different ViT architectures and control domains.

Abstract

Vision Transformers (ViT), when paired with large-scale pretraining, have shown remarkable performance across various computer vision tasks, primarily due to their weak inductive bias. However, while such weak inductive bias aids in pretraining scalability, this may hinder the effective adaptation of ViTs for visuo-motor control tasks as a result of the absence of control-centric inductive biases. Such absent inductive biases include spatial locality and translation equivariance bias which convolutions naturally offer. To this end, we introduce Convolution Injector (CoIn), an add-on module that injects convolutions which are rich in locality and equivariance biases into a pretrained ViT for effective adaptation in visuo-motor control. We evaluate CoIn with three distinct types of pretrained ViTs (CLIP, MVP, VC-1) across 12 varied control tasks within three separate domains (Adroit,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dojeon-ai/CoIn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsConvolution