CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion
Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez

TL;DR
This paper investigates cross-attention mechanisms in vision-language models, demonstrating their efficiency and competitiveness over token insertion methods, especially for real-time video captioning applications.
Contribution
It provides a detailed analysis of cross-attention versus self-attention, trains effective cross-attention VLMs, and showcases their advantages in real-time video captioning.
Findings
Cross-attention models are more efficient than token insertion in VLMs.
Simple cross-attention approaches can match or outperform token insertion methods.
Cross-attention enables low-latency, memory-efficient real-time video captioning.
Abstract
Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
