Monocular Robot Navigation with Self-Supervised Pretrained Vision   Transformers

Miguel Saavedra-Ruiz; Sacha Morin; Liam Paull

arXiv:2203.03682·cs.RO·May 3, 2022·1 cites

Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers

Miguel Saavedra-Ruiz, Sacha Morin, Liam Paull

PDF

Open Access

TL;DR

This paper demonstrates how a self-supervised pretrained Vision Transformer can be adapted for monocular robot navigation, enabling effective perception and control with minimal annotated data in real-time scenarios.

Contribution

It introduces a method to adapt pretrained ViTs for coarse image segmentation in robot navigation using few labeled images, achieving real-time performance on CPU.

Findings

01

Effective coarse segmentation with 70 images

02

Lightweight architectures enable real-time inference on CPU

03

Successful deployment on a mobile robot for lane following and obstacle avoidance

Abstract

In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the 8x8 patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentation at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Softmax