Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving
Yunsheng Ma, Amr Abdelraouf, Rohit Gupta, Ziran Wang, Kyungtae Han

TL;DR
This paper introduces Video Token Sparsification (VTS), a method to reduce visual token redundancy in multimodal large language models for autonomous driving, improving efficiency without performance loss.
Contribution
VTS adaptively prunes visual tokens using a lightweight CNN, significantly enhancing inference speed and reducing memory in multimodal LLMs for autonomous driving.
Findings
Up to 33% increase in inference throughput
28% reduction in memory usage
Maintains performance on DRAMA and LingoQA benchmarks
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Stabilization
