MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving
Enming Zhang, Xingyuan Dai, Min Huang, Yisheng Lv, Qinghai Miao

TL;DR
MiniDrive introduces a lightweight vision-language framework for autonomous driving that efficiently processes multi-level 2D features, outperforming previous models in speed and parameter size while maintaining high accuracy.
Contribution
The paper presents MiniDrive, a novel framework with FE-MoE and DI-Adapter modules that enhance efficiency and multi-image processing in vision-language models for autonomous driving.
Findings
Achieves state-of-the-art performance with only 83M parameters.
Reduces computational cost and improves response efficiency.
Effectively handles multi-camera perception tasks.
Abstract
Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving, performing subtasks such as prediction, planning, and perception through question-and-answer interactions. However, most existing methods rely on computationally expensive visual encoders and large language models (LLMs), making them difficult to deploy in real-world scenarios and real-time applications. Meanwhile, most existing VLMs lack the ability to process multiple images, making it difficult to adapt to multi-camera perception in autonomous driving. To address these issues, we propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter). The FE-MoE effectively maps 2D features into visual token embeddings before being input into the language model. The DI-Adapter enables…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. **Efficiency**: *MiniDrive* is a lightweight VLM with low FLOPs, suitable for real-time deployment on limited hardware, making it highly practical for autonomous driving. 2. **Dynamic Adaptation**: The *Dynamic Instruction Adapter* enhances cross-modal understanding by adapting visual tokens to user instructions, improving interaction quality in real-world applications.
This paper, though notable, leans more toward an engineering approach than a research-oriented contribution. I identify the following limitations: 1. **Insignificant Training Cost Reduction**: Reducing training cost is not significant. A comparable 4-bit or 8-bit quantized large language model (LLM) with ~7B parameters can also be fine-tuned on a single RTX 4090 GPU using adapters, which limits the novelty in terms of efficiency. 2. **Limited Benchmarking Scope**: The integration of UniRepLKNe
1. Goal-Oriented Design for Autonomous Driving: MiniDrive seeks to address the high resource demands of typical vision-language models, specifically aiming to enable real-time processing in the context of autonomous driving. 2. Efficient Model Parameters and FLOPs: The model claims efficiency in terms of FLOPs and memory usage, potentially supporting multi-instance training on a single GPU, which can be beneficial for applications with limited computational resources.
1. Lack of Real-Time Performance Evaluation: Despite the model’s claim of real-time suitability, there is no specific evaluation of inference time or processing speed, which is critical for applications in autonomous driving. The model’s practical performance remains unproven in real-world settings. 2. Limited Novelty of FE-MoE: 1. The FE-MoE mechanism in MiniDrive employs a continuous weighted-sum approach across multiple experts, similar to the foundational work on Mixture of Experts by Shazee
1. The research and idea proposed in this paper are very well-motivated, focusing on the efficiency of the VLM-based approach that is critical to deploying the method in practice. The methods and findings in this paper could inspire relevant research in this direction. 2. The paper applies the convolutional UniRepLKNet as the visual encoder, instead of ViT-based encoders, to efficiently encode image inputs from multiple directions and introduces an MoE to enhance the representations. The ablatio
1. The paper argues that the DI-Adapter is one of its main contributions; although it seems effective, as shown in Table 3, the novelty is very limited. The idea of adapting visual representations conditioned on instruction has been extensively considered in VLM literature, e.g., the classic InstructBLIP. 2. The proposed method encodes observations from six directions for DriveLM QA. However, it is unclear to me how the model utilizes and benefits from cross-image information. - From Section
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsAdapter
