Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciar\'an Eising, Tim Brophy

TL;DR
This study investigates how vision-language models encode visual concepts relevant to automated driving, revealing their strengths and limitations in representing and reasoning about visual information.
Contribution
The paper introduces a method to analyze the linear encoding of visual concepts in VLMs and identifies specific failure modes affecting their performance in automated driving scenarios.
Findings
Object presence is explicitly linearly encoded in VLMs.
Spatial concepts like orientation are implicitly encoded.
Increasing object distance reduces linear separability of concepts.
Abstract
The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
