Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic   Reasoning Task 2024

Jinwoo Ahn; Junhyeok Park; Min-Jun Kim; Kang-Hyeon Kim; So-Yeong Sohn,; Yun-Ji Lee; Du-Seong Chang; Yu-Jung Heo; Eun-Sol Kim

arXiv:2406.05963·cs.CV·June 11, 2024·1 cites

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn,, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim

PDF

Open Access

TL;DR

This paper presents a multimodal reasoning solution for the SMART-101 challenge, combining detailed image captioning and object detection to enhance language model understanding of complex puzzles for children.

Contribution

It introduces a novel approach that grounds visual cues in detailed text and geometric pattern detection to improve multimodal reasoning in challenging puzzles.

Findings

01

Achieved 29.5% option selection accuracy on test set

02

Achieved 27.1% weighted option selection accuracy on challenge set

03

Demonstrated effectiveness of combining captioning and object detection for reasoning

Abstract

In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Logic, programming, and type systems · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Segment Anything Model