From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs

Andrew Kiruluta; Priscilla Burity

arXiv:2506.18943·cs.CV·June 25, 2025

From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs

Andrew Kiruluta, Priscilla Burity

PDF

TL;DR

This paper introduces SDict-VLM, a spectral dictionary-based vision-language model that removes convolutions and self-attention, achieving competitive performance with significantly reduced computational resources and enhanced interpretability.

Contribution

The work presents the first VLM that eliminates both convolutions and self-attention, using a spectral dictionary token mixer for efficient and interpretable multimodal learning.

Findings

01

Achieves BLEU-4 of 39.2, CIDEr of 127.5, SPICE of 27.0 on MS-COCO

02

50.3% accuracy on VQAv2 dataset

03

Uses 60% fewer parameters and 2.2x faster inference than PaLI-3

Abstract

Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.