Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer; Daniil Prokhorov; Jannica Langner; Dominik Bollmann

arXiv:2505.05895·cs.CV·August 6, 2025

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper presents a vision-language framework and dataset for understanding and interacting with automotive UIs, demonstrating strong cross-domain performance and cost-effective deployment of fine-tuned models.

Contribution

Introduces AutomotiveUI-Bench-4K dataset and a fine-tuned ELAM model for automotive UI understanding using vision-language techniques.

Findings

01

Achieved 80.8% accuracy on ScreenSpot benchmark.

02

Demonstrated +5.6% improvement over baseline on ScreenSpot.

03

Model and dataset are publicly available on Hugging Face.

Abstract

Modern automotive infotainment systems necessitate intelligent and adaptive solutions to manage frequent User Interface (UI) updates and diverse design variations. This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs, enabling seamless adaptation across different UI designs. To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released. Additionally, a data pipeline for generating training data is presented. A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face). The approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/sparks-solutions/ELAM-7B
noneOfficial

Models

🤗
sparks-solutions/ELAM-7B
model· 154 dl
154 dl

Datasets

sparks-solutions/AutomotiveUI-Bench-4K
dataset· 454 dl
454 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Persona Design and Applications