See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi

TL;DR
See2Refine is a framework that uses vision-language models to automatically evaluate and improve LLM-generated eHMI actions for automated vehicles, enhancing communication without human supervision.
Contribution
It introduces a human-free, closed-loop system that refines LLM-based eHMI actions using automated visual feedback from VLMs, outperforming prompt-only and baseline methods.
Findings
Framework improves eHMI action appropriateness across modalities.
VLM evaluations align well with human preferences.
Refinement generalizes across different LLM sizes and modalities.
Abstract
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language model (VLM) perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
