Tackling Vision Language Tasks Through Learning Inner Monologues
Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie, Yang, Yi Zhang

TL;DR
This paper introduces IMMO, a novel deep learning approach that simulates inner monologue processes to improve reasoning and explanation in vision-language tasks by enabling dynamic interaction between LLMs and VLMs.
Contribution
IMMO is the first method to learn inner monologue processes within deep models, enhancing vision-language reasoning without relying on predefined scripts.
Findings
Improves reasoning and explanation abilities in vision-language tasks.
Learns inner monologue processes within models, avoiding predefined scripts.
Demonstrates effectiveness on popular vision-language benchmarks.
Abstract
Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition
