LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with   Token Selection, Recovery and Enhancement

Siwen Jiao; Yangyi Fang; Baoyun Peng; Wangqun Chen; Bharadwaj; Veeravalli

arXiv:2411.12980·cs.CV·February 25, 2025

LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

Siwen Jiao, Yangyi Fang, Baoyun Peng, Wangqun Chen, Bharadwaj, Veeravalli

PDF

Open Access

TL;DR

LaVida Drive is a novel VQA framework for autonomous driving that efficiently integrates high-resolution spatial data with temporal information, improving perception accuracy and computational efficiency in dynamic environments.

Contribution

It introduces a dual-module approach with token selection and recovery to enhance visual question answering in autonomous driving, addressing the limitations of existing static or downsampled methods.

Findings

01

Reduces visual tokens significantly

02

Improves efficiency in processing

03

Enhances performance on driving benchmarks

Abstract

Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Web Data Mining and Analysis

MethodsFocus