TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving
Hossein Hassani, Soodeh Nikan, Abdallah Shami

TL;DR
TinyDrive is a lightweight vision-language model designed for multi-view visual question answering in autonomous driving, employing multiscale encoding and selective token routing to achieve high performance with fewer resources.
Contribution
It introduces a novel multiscale visual encoder and dynamic token prioritization mechanisms, enabling efficient VQA in resource-constrained autonomous vehicles.
Findings
Achieves state-of-the-art performance on DriveLM benchmark.
Improves BLEU-4 and METEOR scores by 11.1% and 35.4%.
Uses fewer parameters than existing models.
Abstract
Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
