Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
Zijian Zhang, Wei Liu

TL;DR
This paper introduces a multi-modal neural network that combines text and image pre-trained models with an attention-based fusion layer to improve reasoning in visuo-linguistic puzzles for children, demonstrating superior performance on the SMART-101 dataset.
Contribution
The paper presents a novel multi-modal approach integrating separate pre-trained models with an attention mechanism for reasoning tasks involving visual and textual data.
Findings
Achieved superior performance on SMART-101 dataset
Validated effectiveness of multi-modal pre-trained representations
Demonstrated benefits of attention-based feature fusion
Abstract
In this paper, we present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network in solving visuo-linguistic puzzles designed for specially children in the 6-8 age group. Our model is based on two pre-trained models, dedicated to extract features from text and image respectively. To integrate the features from different modalities, we employed a fusion layer with attention mechanism. We explored different text and image pre-trained models, and fine-tune the integrated classifier on the SMART-101 dataset. Experiment results show that under the data splitting style of puzzle split, our proposed integrated classifier achieves superior performance, verifying the effectiveness of multi-modal pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Semantic Web and Ontologies
