Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Yuwei, Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar

TL;DR
This paper explores the use of vision transformers for end-to-end high-speed obstacle avoidance in quadrotors, demonstrating superior performance over traditional architectures in simulation and hardware tests.
Contribution
It introduces the first application of vision transformers for end-to-end quadrotor control, showing improved accuracy and generalization at high speeds compared to other neural network architectures.
Findings
Vision transformers outperform CNNs and U-Nets at high speeds.
Recurrent models further enhance performance and reduce energy consumption.
Successful real-world implementation up to 7 m/s.
Abstract
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Robotics and Sensor-Based Localization
MethodsAttention Is All You Need · Linear Layer · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Concatenated Skip Connection · Max Pooling · Multi-Head Attention · Layer Normalization
