From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs
Andrew Kiruluta, Priscilla Burity

TL;DR
This paper introduces SDict-VLM, a spectral dictionary-based vision-language model that removes convolutions and self-attention, achieving competitive performance with significantly reduced computational resources and enhanced interpretability.
Contribution
The work presents the first VLM that eliminates both convolutions and self-attention, using a spectral dictionary token mixer for efficient and interpretable multimodal learning.
Findings
Achieves BLEU-4 of 39.2, CIDEr of 127.5, SPICE of 27.0 on MS-COCO
50.3% accuracy on VQAv2 dataset
Uses 60% fewer parameters and 2.2x faster inference than PaLI-3
Abstract
Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
