All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Davide Testa; Giovanni Bonetta; Raffaella Bernardi; Alessandro Bondielli; Alessandro Lenci; Alessio Miaschi; Lucia Passaro; Bernardo Magnini

arXiv:2502.16989·cs.CL·September 23, 2025

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Davide Testa, Giovanni Bonetta, Raffaella Bernardi, Alessandro Bondielli, Alessandro Lenci, Alessio Miaschi, Lucia Passaro, Bernardo Magnini

PDF

Open Access

TL;DR

The paper presents MAIA, a comprehensive Italian benchmark for evaluating multimodal reasoning in visual language models on videos, focusing on reasoning categories, cultural relevance, and language-specific challenges.

Contribution

It introduces MAIA, a novel Italian video benchmark with detailed reasoning categories and dual tasks, to assess and analyze the reasoning capabilities of vision-language models.

Findings

01

Models show low performance, indicating fragility in reasoning abilities.

02

MAIA effectively disentangles language and vision relations.

03

The benchmark highlights the need for improved multimodal reasoning models.

Abstract

We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation

MethodsSparse Evolutionary Training