INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Chenwei Lin; Hanjia Lyu; Xian Xu; Jiebo Luo

arXiv:2406.09105·cs.CV·August 11, 2025

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo

PDF

Open Access 1 Repo 2 Datasets 3 Reviews

TL;DR

This paper introduces INS-MMBench, a comprehensive benchmark for evaluating the performance of large vision-language models in the insurance domain, addressing a significant gap in specialized multimodal assessment tools.

Contribution

It presents the first hierarchical insurance-specific benchmark with 22 tasks, evaluates 11 LVLMs, and provides insights to advance LVLM application in insurance.

Findings

01

INS-MMBench effectively assesses LVLMs in insurance tasks.

02

Current LVLMs show strengths and limitations in insurance scenarios.

03

Benchmark results guide future model development for insurance applications.

Abstract

Large Vision-Language Models (LVLMs) and Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance in various general multimodal applications and have shown increasing promise in specialized domains. However, their potential in the insurance domain-characterized by diverse application scenarios and rich multimodal data-remains largely underexplored. To date, there is no systematic review of multimodal tasks, nor a benchmark specifically designed to assess the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance industry. This study systematically reviews and categorizes multimodal tasks for 4 representative types of insurance: auto, property, health, and agricultural. We introduce INS-MMBench, the first hierarchical benchmark tailored for the insurance domain. INS-MMBench encompasses 22 fundamental tasks, 12…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

1. Comprehensive Benchmark: The paper presents INS-MMBench, which is the first comprehensive benchmark tailored for evaluating LVLMs in the insurance domain. This benchmark is extensive, covering 8,856 multiple-choice visual questions across 12 meta-tasks and 22 fundamental tasks, providing a robust framework for assessing LVLM capabilities in various insurance scenarios. 2. Systematic Framework: The authors have developed a systematic and hierarchical task definition that ensures the tasks are

Weaknesses

1. Multi-Choice Format Limitations: This benchmark follows a similar style to MMBench and MME in the general multimodal domain, all of which formulate their questions into multiple-choice formats. While this is an effective method for evaluating model performance, it has limitations that prevent generalization to open-ended question answering, which is more representative of real-world applications. 2. Static Benchmark and Data Leakage: The benchmark is static, which does not mitigate the data

Reviewer 02Rating 3Confidence 4

Strengths

1. The motivation behind establishing an insurance benchmark is worthwhile. Evaluating LVLMs' capabilities on core insurance stages like underwriting and claims processing is practical and meaningful. 2. The benchmark covers a reasonable range of core insurance types relevant to key areas in everyday insurance applications. 3. The study provides an insightful error analysis, highlighting the current limitations of LVLMs in interpreting insurance-specific visual content.

Weaknesses

1. **Misalignment between Intent and Implementation**: While the authors claim the benchmark includes 12 meta-tasks and 22 fundamental tasks across stages like underwriting and claims processing in the Introduction section, the tasks illustrated in the paper are only loosely related to these stages. For example, meta-tasks in auto insurance such as “vehicle information extraction” and “vehicle damage detection” focus heavily on general computer vision tasks rather than directly addressing insura

Reviewer 03Rating 6Confidence 3

Strengths

(1) Originality: INS-MMBench is the first benchmark tailored to evaluate LVLMs in the insurance domain. The authors' approach to defining tasks using a bottom-up hierarchical methodology is innovative and ensures that the benchmark aligns with real-world insurance scenarios, making it a pioneering effort in applying LVLMs to this new domain. (2) Quality: The authors systematically identify and organize multimodal tasks across four types of insurance, and their comprehensive evaluation of ten LV

Weaknesses

1. Benchmark Definition Lacks Depth in Insurance Scenarios While INS-MMBench introduces tasks related to insurance, many are more aligned with general, common-sense VQA rather than specialized, nuanced scenarios seen in real-world insurance applications. To better reflect practical needs, the benchmark should include more complex tasks, such as multi-step reasoning or risk assessment based on a mix of visual and contextual data. 2. Overemphasis on Basic Tasks Some tasks, like license plate re

Code & Models

Repositories

fdu-ins/ins-mmbench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInsurance and Financial Risk Management