HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving
Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li

TL;DR
This paper introduces HiLM-D, a resource-efficient framework that enhances multimodal large language models with multi-scale high-resolution visual perception for improved autonomous driving risk assessment and planning.
Contribution
HiLM-D integrates a lightweight high-resolution perception stream with temporal reasoning to improve visual understanding in MLLMs for autonomous driving tasks.
Findings
Significant improvements in captioning accuracy (3.7% BLEU-4)
Enhanced detection performance (8.7% mIoU)
Effective integration of high-resolution perception in MLLMs
Abstract
Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsFocus
