Enhance Reasoning Ability of Visual-Language Models via Large Language   Models

Yueting Yang; Xintong Zhang; Wenjuan Han

arXiv:2305.13267·cs.CL·May 23, 2023·1 cites

Enhance Reasoning Ability of Visual-Language Models via Large Language Models

Yueting Yang, Xintong Zhang, Wenjuan Han

PDF

Open Access

TL;DR

This paper introduces TReE, a method that enhances the reasoning ability of visual-language models by transferring reasoning skills from large language models through a three-stage process.

Contribution

The paper presents a novel zero-shot approach called TReE that transfers reasoning capabilities from LLMs to VLMs, improving their reasoning performance.

Findings

01

TReE significantly improves VLM reasoning ability.

02

The method effectively transfers reasoning skills in zero-shot scenarios.

03

Enhanced VLMs outperform baseline models in reasoning tasks.

Abstract

Pre-trained visual language models (VLM) have shown excellent performance in image caption tasks. However, it sometimes shows insufficient reasoning ability. In contrast, large language models (LLMs) emerge with powerful reasoning capabilities. Therefore, we propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios. TReE contains three stages: observation, thinking, and re-thinking. Observation stage indicates that VLM obtains the overall information of the relative image. Thinking stage combines the image information and task description as the prompt of the LLM, inference with the rationals. Re-Thinking stage learns from rationale and then inference the final result through VLM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling