Training Strategies for Vision Transformers for Object Detection
Apoorv Singh

TL;DR
This paper explores optimization strategies for vision transformer-based object detection in autonomous driving, achieving significant inference-time improvements with minimal performance loss, suitable for real-time deployment on edge devices.
Contribution
It introduces inference-time optimization strategies for vision transformers, balancing accuracy and speed, and demonstrates their effectiveness in real-world autonomous driving scenarios.
Findings
Inference-time improved by 63% with only 3% accuracy drop.
Transformers' inference time reduced below traditional CNN detectors.
Strategies validated with float32 and float16 precision using TensorRT.
Abstract
Vision-based Transformer have shown huge application in the perception module of autonomous driving in terms of predicting accurate 3D bounding boxes, owing to their strong capability in modeling long-range dependencies between the visual features. However Transformers, initially designed for language models, have mostly focused on the performance accuracy, and not so much on the inference-time budget. For a safety critical system like autonomous driving, real-time inference at the on-board compute is an absolute necessity. This keeps our object detection algorithm under a very tight run-time budget. In this paper, we evaluated a variety of strategies to optimize on the inference-time of vision transformers based object detection methods keeping a close-watch on any performance variations. Our chosen metric for these strategies is accuracy-runtime joint optimization. Moreover, for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Industrial Vision Systems and Defect Detection · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Dropout · Dense Connections · Convolution · Adam · Non Maximum Suppression · Layer Normalization · Softmax · Linear Layer · 1x1 Convolution
