Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

Md Tasnin Tanvir; Soumitra Das; Sk Md Abidar Rahaman; Ali Shiri Sichani

arXiv:2511.18504·cs.CV·November 25, 2025

Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

Md Tasnin Tanvir, Soumitra Das, Sk Md Abidar Rahaman, Ali Shiri Sichani

PDF

Open Access

TL;DR

This paper introduces adaptive compression methods for edge vision-language models, significantly reducing computational requirements while maintaining high performance, enabling real-time deployment on resource-limited devices.

Contribution

It proposes Sparse Temporal Token Fusion and Adaptive Neural Compression, novel techniques that dynamically adapt model complexity based on scene content for efficient edge deployment.

Findings

01

Achieved state-of-the-art accuracy with 2.3x fewer parameters.

02

Reduced on-device FLOPs by up to 62x.

03

Decreased token count by 84% in event-based vision tasks.

Abstract

The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques -- Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) -- that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques