Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression
Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu

TL;DR
This paper introduces SRC-Pipeline, an efficient vision-language model for autonomous driving VQA that significantly reduces computational costs by compressing scene tokens, enabling real-time processing without sacrificing accuracy.
Contribution
The paper presents a novel token compression framework for VQA in autonomous driving, reducing FLOPs by 66% while maintaining performance, facilitating real-time deployment.
Findings
Achieves 66% reduction in FLOPs.
Maintains comparable VQA performance.
Enables real-time autonomous driving VQA.
Abstract
Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
