Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Hao Zhou; Zhanning Gao; Zhili Chen; Maosheng Ye; Qifeng Chen; Tongyi Cao; Honggang Qi

arXiv:2411.13076·cs.CV·October 16, 2025

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Hao Zhou, Zhanning Gao, Zhili Chen, Maosheng Ye, Qifeng Chen, Tongyi Cao, Honggang Qi

PDF

Open Access

TL;DR

The paper introduces the Hints of Prompt (HoP) framework, which enhances visual representations in multimodal large language models for autonomous driving by incorporating affinity, semantic, and question hints to improve understanding of complex driving scenarios.

Contribution

The novel HoP framework integrates three types of hints to improve multimodal LLMs' performance in autonomous driving environments, especially in complex and long-tail cases.

Findings

01

Significantly outperforms previous methods in key metrics

02

Enriches visual representations with limited domain data

03

Faster adaptation to driving scenarios

Abstract

In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Semantic Web and Ontologies

MethodsALIGN · Hierarchical Information Threading · Contrastive Language-Image Pre-training