Multi-Dimensional Insights: Benchmarking Real-World Personalization in   Large Multimodal Models

YiFan Zhang; Shanglin Lei; Runqi Qiao; Zhuoma GongQue; Xiaoshuai Song,; Guanting Dong; Qiuna Tan; Zhe Wei; Peiqing Yang; Ye Tian; Yadong Xue; Xiaofei; Wang; Honggang Zhang

arXiv:2412.12606·cs.AI·December 18, 2024

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song,, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian, Yadong Xue, Xiaofei, Wang, Honggang Zhang

PDF

Open Access

TL;DR

The paper introduces the Multi-Dimensional Insights (MDI) benchmark to evaluate large multimodal models across diverse real-world scenarios, emphasizing understanding, reasoning, and personalization for different age groups.

Contribution

It presents a comprehensive, multi-faceted benchmark with over 500 images, stratified questions, and age-specific assessments to better evaluate LMMs' real-world alignment and personalization capabilities.

Findings

01

GPT-4o achieves 79% accuracy on age-related tasks.

02

Existing LMMs still have significant room for improvement.

03

The benchmark reveals gaps in models' ability to meet diverse human needs.

Abstract

The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimedia Communication and Technology

MethodsALIGN