LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

Guy Yariv; Idan Schwartz; Yossi Adi; Sagie Benaim

arXiv:2406.13621·cs.CL·April 14, 2026

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim

PDF

2 Repos

TL;DR

LaMI introduces a late multi-image fusion technique that enhances large language models with visual signals at test time, improving visual reasoning and NLP performance without extensive retraining.

Contribution

The paper presents a novel late-fusion approach that combines multiple generated images with textual predictions, outperforming prior methods on visual and textual benchmarks.

Findings

01

Outperforms previous augmented LLMs on visual reasoning tasks.

02

Matches vision-language models on vision-based tasks.

03

Improves NLP performance with modest test-time overhead.

Abstract

Commonsense reasoning often requires both textual and visual knowledge, yet Large Language Models (LLMs) trained solely on text lack visual grounding (e.g., "what color is an emperor penguin's belly?"). Visual Language Models (VLMs) perform better on visually grounded tasks but face two limitations: (i) often reduced performance on text-only commonsense reasoning compared to text-trained LLMs, and (ii) adapting newly released LLMs to vision input typically requires costly multimodal training. An alternative augments LLMs with test-time visual signals, improving visual commonsense without harming textual reasoning, but prior designs often rely on early fusion and a single image, which can be suboptimal. We propose a late multi-image fusion method: multiple images are generated from the text prompt with a lightweight parallel sampling, and their prediction probabilities are combined with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.