MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering
Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang

TL;DR
MV-CoRe is a novel multimodal reasoning model that integrates diverse visual and linguistic features through a transformer to improve complex visual question answering performance.
Contribution
It introduces a deep fusion approach combining global embeddings, semantic-aware visual features, and scene graphs with a multimodal transformer for enhanced reasoning.
Findings
Achieves 77.5% accuracy on GQA benchmark.
Outperforms existing LVLM baselines.
Ablation confirms importance of object and scene graph features.
Abstract
Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
