MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

Seif Ahmed; Mohamed T. Younes; Abdelrahman Moustafa; Abdelrahman Allam; Hamza Moustafa

arXiv:2507.11114·cs.CL·July 16, 2025

MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

Seif Ahmed, Mohamed T. Younes, Abdelrahman Moustafa, Abdelrahman Allam, Hamza Moustafa

PDF

Open Access

TL;DR

This paper introduces a robust ensemble system for multilingual multimodal reasoning that combines vision-language models and prompt engineering, achieving top accuracy in the ImageCLEF 2025 challenge across multiple languages.

Contribution

The paper presents a novel ensemble approach integrating multiple vision-language models with prompt strategies, demonstrating superior performance in multilingual multimodal reasoning tasks.

Findings

01

Ensemble system achieved 81.4% accuracy on the leaderboard.

02

Prompt design significantly improved model accuracy from 55.9% to 61.7%.

03

Zero-shot model outperformed trained models in experiments.

Abstract

We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications