Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex, Smola

TL;DR
This paper introduces Multimodal-CoT, a framework that combines text and images for reasoning in language models, improving accuracy and efficiency on science and visual question-answering benchmarks.
Contribution
It extends chain-of-thought reasoning to multimodal inputs, enabling better rationales and answer inference in language models with a two-stage process.
Findings
Achieves state-of-the-art performance on ScienceQA with under 1B parameters.
Mitigates hallucination and improves convergence speed.
Effective in multimodal reasoning tasks.
Abstract
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating…
Peer Reviews
Decision·Submitted to ICLR 2024
1. This paper proposed a multimodal CoT reasoning framework by fusing the vision features extracted by ViT with the language features, which can mitigate the challenge of hallucination. 2. This paper separated the CoT reasoning process into two stages: rationale generation and answer inference. 3. This paper conducted extensive experiments and analysis. Experiment results demonstrated the effectiveness of the proposed methods.
1. The performance of Multimodal-CoT falls behind some baselines, e.g. LLaVa (https://arxiv.org/abs/2304.08485) on the ScienceQA dataset and LXMERT on the AOKVQA dataset (https://aclanthology.org/D19-1514.pdf). 2. It would be better to add more explanation or motivation about separating the reasoning process into two-stage works.
- State-of-the-art performance on two benchmarks. - Simple yet effective approach on improving reasoning in vision and language settings. - Comprehensive analysis on the proposed model.
- A few arguments are not convincing or well-supported. For instance, more rigorous experiments are needed to claim *surpassing human performance*: on the one hand, humans can show significant variances when working on the same problem; on the other hand, ScienceQA collects the human performance baseline with Amazon Mechanical Turk, which is quite hard to control the data quality. - This paper overclaims on multimodal CoT, while only vision and text are evaluated. Other modalities, such as audio
1. Innovative Approach: The integration of multimodal data (text and images) into CoT reasoning is a significant advancement, addressing a gap in previous research which focused mainly on language modality. 2. Mitigation of Hallucination: The approach specifically targets and successfully mitigates the issue of hallucination in answer inference, a common problem in smaller language models. 3. Detailed Analysis: The paper provides a comprehensive background study and analysis of existing CoT tech
1. Limited Scope of Evaluation: The paper only evaluated their approach using 2 benchmark datasets like ScienceQA and AOKVQA. While these datasets are relevant and challenging, the paper represents a specific type of reasoning tasks. 2. The paper demonstrates the effectiveness of the proposed method primarily in the context of encoder-decoder models. However, its effectiveness in popular left-to-right language models, which are widely used, is not explicitly addressed. This omission can limit t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
