Video Token Sparsification for Efficient Multimodal LLMs in Autonomous   Driving

Yunsheng Ma; Amr Abdelraouf; Rohit Gupta; Ziran Wang; Kyungtae Han

arXiv:2409.11182·cs.CV·September 18, 2024

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Yunsheng Ma, Amr Abdelraouf, Rohit Gupta, Ziran Wang, Kyungtae Han

PDF

Open Access

TL;DR

This paper introduces Video Token Sparsification (VTS), a method to reduce visual token redundancy in multimodal large language models for autonomous driving, improving efficiency without performance loss.

Contribution

VTS adaptively prunes visual tokens using a lightweight CNN, significantly enhancing inference speed and reducing memory in multimodal LLMs for autonomous driving.

Findings

01

Up to 33% increase in inference throughput

02

28% reduction in memory usage

03

Maintains performance on DRAMA and LingoQA benchmarks

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Stabilization