Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic   Reasoning Task 2023

Xiangyu Wu; Yang Yang; Shengdong Xu; Yifeng Wu; Qingguo Chen; Jianfeng; Lu

arXiv:2310.06440·cs.CV·October 11, 2023

Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023

Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, Qingguo Chen, Jianfeng, Lu

PDF

Open Access

TL;DR

This paper presents a multi-modal reasoning solution for the SMART-101 Challenge, combining question categorization, object detection, OCR, and adaptive visual feature extraction to address visuolinguistic puzzles for children.

Contribution

The approach introduces a divide-and-conquer method with question type classification, object detection, OCR, and adaptive visual features, tailored for children's visuolinguistic puzzles.

Findings

01

Achieved 26.5% accuracy on validation set

02

Achieved 24.3% accuracy on private test set

03

Demonstrated effectiveness of multi-modal integration

Abstract

In this paper, we present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. Different from the traditional visual question-answering datasets, this challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles designed specifically for children in the 6-8 age group. We employed a divide-and-conquer approach. At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner. Additionally, we trained a yolov7 model on the icon45 dataset for object detection and combined it with the OCR method to recognize and locate objects and text within the images. At the model level, we utilized the BLIP-2 model and added eight adapters to the image encoder VIT-G…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition