Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn,, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim

TL;DR
This paper presents a multimodal reasoning solution for the SMART-101 challenge, combining detailed image captioning and object detection to enhance language model understanding of complex puzzles for children.
Contribution
It introduces a novel approach that grounds visual cues in detailed text and geometric pattern detection to improve multimodal reasoning in challenging puzzles.
Findings
Achieved 29.5% option selection accuracy on test set
Achieved 27.1% weighted option selection accuracy on challenge set
Demonstrated effectiveness of combining captioning and object detection for reasoning
Abstract
In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Logic, programming, and type systems · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · Segment Anything Model
