TL;DR
CoreMatching introduces a co-adaptive sparse inference framework that synergistically combines token and neuron pruning to significantly accelerate vision-language models, surpassing state-of-the-art efficiency benchmarks.
Contribution
This work uncovers the interplay between token and neuron sparsity in VLMs and proposes a novel framework leveraging their synergy for improved inference efficiency.
Findings
Achieved 5x FLOPs reduction on NVIDIA Titan Xp
Realized 10x overall speedup in inference
Surpassed state-of-the-art baselines on multiple image understanding tasks
Abstract
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
