HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for   Autonomous Driving

Xinpeng Ding; Jianhua Han; Hang Xu; Wei Zhang; Xiaomeng Li

arXiv:2309.05186·cs.CV·March 25, 2025·22 cites

HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving

Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li

PDF

Open Access

TL;DR

This paper introduces HiLM-D, a resource-efficient framework that enhances multimodal large language models with multi-scale high-resolution visual perception for improved autonomous driving risk assessment and planning.

Contribution

HiLM-D integrates a lightweight high-resolution perception stream with temporal reasoning to improve visual understanding in MLLMs for autonomous driving tasks.

Findings

01

Significant improvements in captioning accuracy (3.7% BLEU-4)

02

Enhanced detection performance (8.7% mIoU)

03

Effective integration of high-resolution perception in MLLMs

Abstract

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsFocus