Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?
Tobias Norlund, Lovisa Hagstr\"om, Richard Johansson

TL;DR
This paper proposes a new evaluation method and architecture to measure and enhance visual knowledge transfer in large language models, aiming to reduce hallucinations and improve factual accuracy.
Contribution
It introduces a novel task and filtering method to evaluate visual knowledge transfer, along with a new model architecture incorporating visual imagination.
Findings
The evaluation method effectively measures visual knowledge transfer.
The proposed architecture shows promising results in leveraging multimodal knowledge.
Models with visual imagination outperform baseline models in knowledge transfer tasks.
Abstract
Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
