MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Jingwei Peng; Jiehao Chen; Mateo Alejandro Rojas; Meilin Zhang

arXiv:2508.07023·cs.CV·August 12, 2025

MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang

PDF

Open Access

TL;DR

MV-CoRe is a novel multimodal reasoning model that integrates diverse visual and linguistic features through a transformer to improve complex visual question answering performance.

Contribution

It introduces a deep fusion approach combining global embeddings, semantic-aware visual features, and scene graphs with a multimodal transformer for enhanced reasoning.

Findings

01

Achieves 77.5% accuracy on GQA benchmark.

02

Outperforms existing LVLM baselines.

03

Ablation confirms importance of object and scene graph features.

Abstract

Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning