Differential Multimodal Transformers

Jerry Li; Timothy Oh; Joseph Hoang; Vardhit Veeramachaneni

arXiv:2507.15875·cs.AI·July 23, 2025

Differential Multimodal Transformers

Jerry Li, Timothy Oh, Joseph Hoang, Vardhit Veeramachaneni

PDF

Open Access

TL;DR

This paper extends Differential Attention to multimodal transformers, specifically PaliGemma, to improve noisy information filtering and reduce hallucinations in vision-language tasks through fine-tuning.

Contribution

It introduces the adaptation of Differential Attention for text-vision models and demonstrates its effectiveness in enhancing information retrieval and question-answering.

Findings

01

Differential Attention improves noisy information filtering.

02

Fine-tuning with Differential Attention reduces hallucinations.

03

Enhanced question-answering performance observed.

Abstract

Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems