Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle   Avoidance

Anish Bhattacharya; Nishanth Rao; Dhruv Parikh; Pratik Kunapuli; Yuwei; Wu; Yuezhan Tao; Nikolai Matni; Vijay Kumar

arXiv:2405.10391·cs.RO·April 3, 2025·2 cites

Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Yuwei, Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar

PDF

Open Access

TL;DR

This paper explores the use of vision transformers for end-to-end high-speed obstacle avoidance in quadrotors, demonstrating superior performance over traditional architectures in simulation and hardware tests.

Contribution

It introduces the first application of vision transformers for end-to-end quadrotor control, showing improved accuracy and generalization at high speeds compared to other neural network architectures.

Findings

01

Vision transformers outperform CNNs and U-Nets at high speeds.

02

Recurrent models further enhance performance and reduce energy consumption.

03

Successful real-world implementation up to 7 m/s.

Abstract

We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Robotics and Sensor-Based Localization

MethodsAttention Is All You Need · Linear Layer · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Concatenated Skip Connection · Max Pooling · Multi-Head Attention · Layer Normalization