Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Jingxuan Wei, Cheng Tan, Zhangyang Gao, Linzhuang Sun, Siyuan Li,, Bihui Yu, Ruifeng Guo, Stan Z. Li

TL;DR
This paper introduces COCO-MMR, a challenging open-ended multimodal reasoning dataset based on COCO, and proposes innovative techniques like multi-hop attention and contrastive learning to improve reasoning capabilities in AI models.
Contribution
The paper presents a new open-ended multimodal reasoning dataset and novel techniques to enhance reasoning performance, addressing limitations of previous datasets and approaches.
Findings
The dataset effectively evaluates multimodal reasoning with open-ended questions.
Proposed techniques improve the reasoning accuracy of models on the dataset.
Extensive experiments validate the effectiveness of the dataset and methods.
Abstract
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
