Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers
Miguel Saavedra-Ruiz, Sacha Morin, Liam Paull

TL;DR
This paper demonstrates how a self-supervised pretrained Vision Transformer can be adapted for monocular robot navigation, enabling effective perception and control with minimal annotated data in real-time scenarios.
Contribution
It introduces a method to adapt pretrained ViTs for coarse image segmentation in robot navigation using few labeled images, achieving real-time performance on CPU.
Findings
Effective coarse segmentation with 70 images
Lightweight architectures enable real-time inference on CPU
Successful deployment on a mobile robot for lane following and obstacle avoidance
Abstract
In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the 8x8 patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentation at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Softmax
