Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression
Md Tasnin Tanvir, Soumitra Das, Sk Md Abidar Rahaman, Ali Shiri Sichani

TL;DR
This paper introduces adaptive compression methods for edge vision-language models, significantly reducing computational requirements while maintaining high performance, enabling real-time deployment on resource-limited devices.
Contribution
It proposes Sparse Temporal Token Fusion and Adaptive Neural Compression, novel techniques that dynamically adapt model complexity based on scene content for efficient edge deployment.
Findings
Achieved state-of-the-art accuracy with 2.3x fewer parameters.
Reduced on-device FLOPs by up to 62x.
Decreased token count by 84% in event-based vision tasks.
Abstract
The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques -- Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) -- that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
