TL;DR
LaMI introduces a late multi-image fusion technique that enhances large language models with visual signals at test time, improving visual reasoning and NLP performance without extensive retraining.
Contribution
The paper presents a novel late-fusion approach that combines multiple generated images with textual predictions, outperforming prior methods on visual and textual benchmarks.
Findings
Outperforms previous augmented LLMs on visual reasoning tasks.
Matches vision-language models on vision-based tasks.
Improves NLP performance with modest test-time overhead.
Abstract
Commonsense reasoning often requires both textual and visual knowledge, yet Large Language Models (LLMs) trained solely on text lack visual grounding (e.g., "what color is an emperor penguin's belly?"). Visual Language Models (VLMs) perform better on visually grounded tasks but face two limitations: (i) often reduced performance on text-only commonsense reasoning compared to text-trained LLMs, and (ii) adapting newly released LLMs to vision input typically requires costly multimodal training. An alternative augments LLMs with test-time visual signals, improving visual commonsense without harming textual reasoning, but prior designs often rely on early fusion and a single image, which can be suboptimal. We propose a late multi-image fusion method: multiple images are generated from the text prompt with a lightweight parallel sampling, and their prediction probabilities are combined with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
