Enhance Reasoning Ability of Visual-Language Models via Large Language Models
Yueting Yang, Xintong Zhang, Wenjuan Han

TL;DR
This paper introduces TReE, a method that enhances the reasoning ability of visual-language models by transferring reasoning skills from large language models through a three-stage process.
Contribution
The paper presents a novel zero-shot approach called TReE that transfers reasoning capabilities from LLMs to VLMs, improving their reasoning performance.
Findings
TReE significantly improves VLM reasoning ability.
The method effectively transfers reasoning skills in zero-shot scenarios.
Enhanced VLMs outperform baseline models in reasoning tasks.
Abstract
Pre-trained visual language models (VLM) have shown excellent performance in image caption tasks. However, it sometimes shows insufficient reasoning ability. In contrast, large language models (LLMs) emerge with powerful reasoning capabilities. Therefore, we propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios. TReE contains three stages: observation, thinking, and re-thinking. Observation stage indicates that VLM obtains the overall information of the relative image. Thinking stage combines the image information and task description as the prompt of the LLM, inference with the rationals. Re-Thinking stage learns from rationale and then inference the final result through VLM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
