Integrating Text and Image Pre-training for Multi-modal Algorithmic   Reasoning

Zijian Zhang; Wei Liu

arXiv:2406.05318·cs.CV·June 11, 2024·1 cites

Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning

Zijian Zhang, Wei Liu

PDF

Open Access

TL;DR

This paper introduces a multi-modal neural network that combines text and image pre-trained models with an attention-based fusion layer to improve reasoning in visuo-linguistic puzzles for children, demonstrating superior performance on the SMART-101 dataset.

Contribution

The paper presents a novel multi-modal approach integrating separate pre-trained models with an attention mechanism for reasoning tasks involving visual and textual data.

Findings

01

Achieved superior performance on SMART-101 dataset

02

Validated effectiveness of multi-modal pre-trained representations

03

Demonstrated benefits of attention-based feature fusion

Abstract

In this paper, we present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network in solving visuo-linguistic puzzles designed for specially children in the 6-8 age group. Our model is based on two pre-trained models, dedicated to extract features from text and image respectively. To integrate the features from different modalities, we employed a fusion layer with attention mechanism. We explored different text and image pre-trained models, and fine-tune the integrated classifier on the SMART-101 dataset. Experiment results show that under the data splitting style of puzzle split, our proposed integrated classifier achieves superior performance, verifying the effectiveness of multi-modal pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Semantic Web and Ontologies