TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Lihong Chen; Hossein Hassani; Soodeh Nikan

arXiv:2505.12670·cs.CV·May 20, 2025

TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Lihong Chen, Hossein Hassani, Soodeh Nikan

PDF

Open Access

TL;DR

This paper introduces TS-VLM, a lightweight vision-language model with a novel Text-Guided SoftSort Pooling module that improves multi-view reasoning accuracy and efficiency for autonomous driving.

Contribution

The paper proposes a new query-aware pooling method that enhances multi-view fusion in vision-language models, reducing computational costs significantly.

Findings

01

Outperforms state-of-the-art models on DriveLM benchmark.

02

Reduces computational cost by up to 90%.

03

Contains only 20.1 million parameters in the smallest version.

Abstract

Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety

MethodsSoftmax · Attention Is All You Need