TL;DR
Fourier Compressor is a novel, parameter-free frequency-domain module that significantly reduces computational costs in vision-language models while maintaining high accuracy, applicable to images and videos.
Contribution
It introduces a frequency-domain visual token compression method that outperforms existing parameter-free approaches and generalizes across multiple architectures and tasks.
Findings
Retains over 96% of original accuracy with up to 83.8% FLOPs reduction.
Boosts generation speed by 31.2%.
Outperforms existing parameter-free methods and surpasses some parameterized approaches.
Abstract
Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗whyisverysmart/Fourier-LLaVA-v1.5-13B-144model· 15 dl15 dl
- 🤗whyisverysmart/Fourier-LLaVA-v1.5-7B-256model· 44 dl44 dl
- 🤗whyisverysmart/Fourier-LLaVA-v1.5-7B-144model· 9 dl9 dl
- 🤗whyisverysmart/Fourier-LLaVA-v1.5-7B-64model· 14 dl14 dl
- 🤗whyisverysmart/Fourier-LLaVA-v1.5-7B-36model· 14 dl14 dl
- 🤗whyisverysmart/Fourier-Qwen2-VL-2B-0.67model· 47 dl47 dl
- 🤗whyisverysmart/Fourier-Qwen2.5-VL-3B-0.67model· 33 dl33 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
