Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

Khizar Anjum; Muhammad Arbab Arshad; Kadhim Hayawi; Efstathios Polyzos; Asadullah Tariq; Mohamed Adel Serhani; Laiba Batool; Brady Lund; Nishith Reddy Mannuru; Ravi Varma Kumar Bevara; Taslim Mahbub; Muhammad Zeeshan Akram; Sakib Shahriar

arXiv:2506.12958·cs.LG·June 23, 2025·2 cites

Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

Khizar Anjum, Muhammad Arbab Arshad, Kadhim Hayawi, Efstathios Polyzos, Asadullah Tariq, Mohamed Adel Serhani, Laiba Batool, Brady Lund, Nishith Reddy Mannuru, Ravi Varma Kumar Bevara, Taslim Mahbub, Muhammad Zeeshan Akram, Sakib Shahriar

PDF

Open Access

TL;DR

This paper presents a taxonomy of seven key disciplines, reviews domain-specific LLM benchmarks, and categorizes them to facilitate targeted evaluation and advancement of multimodal large language models across various fields.

Contribution

It introduces a domain-specific taxonomy and compiles benchmarks, addressing the gap in domain-focused evaluation of multimodal large language models.

Findings

01

Identified seven key disciplines for LLM application.

02

Provided a comprehensive review of domain-specific benchmarks.

03

Created a categorized resource for future research.

Abstract

Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification