CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning
Andrew Kiruluta, Preethi Raju, Priscilla Burity

TL;DR
This paper introduces CSAT, a compressed sensing-based attention mechanism that reduces computational costs in vision-language models by leveraging sparsity and low-dimensional projections, enabling scalable multimodal reasoning.
Contribution
The paper proposes a novel compressed sensing attention transformer (CSAT) that significantly decreases attention complexity in vLLMs while preserving semantic accuracy, especially effective for video and language representations.
Findings
CSAT reduces attention complexity by leveraging sparsity.
Maintains semantic fidelity with lower-dimensional projections.
Improves scalability and efficiency of vision-language models.
Abstract
Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
