Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Shengwu.Xiong; Tianyu.Zou; Cong.Wang; Xuelong Li

arXiv:2506.21572·cs.CL·November 14, 2025

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Shengwu.Xiong, Tianyu.Zou, Cong.Wang, Xuelong Li

PDF

TL;DR

This paper introduces a new framework and benchmark for evaluating multimodal large language models by aligning tasks with human-like cognitive structures, improving interpretability and diagnostic power.

Contribution

It proposes a structural equation modeling framework and a Piaget-inspired hierarchy to better evaluate MLLM capabilities, leading to the creation of the GOLD benchmark.

Findings

01

GOLD benchmark shows improved interpretability over previous benchmarks.

02

The framework reduces indicator redundancy and enhances diagnostic clarity.

03

Experiments demonstrate better alignment with human cognitive structures.

Abstract

Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.