OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

Zijian Chen; Tingzhu Chen; Wenjun Zhang; Guangtao Zhai

arXiv:2412.01175·cs.CV·February 12, 2025·3 cites

OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai

PDF

Open Access 1 Video 3 Reviews

TL;DR

OBI-Bench is a comprehensive benchmark designed to evaluate large multi-modal models on complex oracle bone inscription tasks, highlighting current challenges and potential in ancient script research.

Contribution

The paper introduces OBI-Bench, a new benchmark with diverse data and tasks, to systematically assess LMMs' abilities in ancient oracle bone script analysis.

Findings

01

Current LMMs struggle with fine-grained perception tasks.

02

Some models perform comparably to untrained humans in deciphering.

03

Even top models are far from expert-level performance.

Abstract

We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single characters, and handprinted characters. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. Novel (and important) Application Domain - Important systematic evaluation of LMMs for ancient script analysis, addressing a significant real-world problem in historical research with potential to accelerate archaeological research and cultural heritage preservation. 2. Comprehensive fine-grained benchmark - covering tasks like recognition, rejoining, classification, retrieval, and deciphering 3. Specific data curation for each task helps to answer queries specific to each of them.

Weaknesses

1. The overall technical importance of the benchmark might be limited - the evaluation is very sensitive to the query-answer form (spectrum of questions), even a little change or adding some context can influence the outcomes drastically. Although the work is quite extensive, a decent foray into how prompt engineering can impact the performance could have been a nice addition to the work. Eg: Take the example of Deciphering task, use the best open-source and proprietary models (as already done i

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper covers five specific areas of oracle bone script exploration, providing a comprehensive summary of prior work in the field. 2. The evaluation dimensions of this paper are diverse, it attempts to explore more fine-grained capabilities through the design of questions such as "How" and "Where" questions. Although the design of these two types of questions may not be critical, this approach seems to offer a potential framework for exploring process supervision mechanisms for models in

Weaknesses

1. There are some long-tail issues in the data volume of each task, particularly with an excessive amount for Recognition and insufficient data for Deciphering. 2. Despite the diversity of tasks, would it be possible to provide an overall score to evaluate the comprehensive ability of LMMs? This score should not be a simple average but should also take into account that LMMs are still in the early stages in the field of oracle bone scripts. Consideration should be given to the lower scores that

Reviewer 03Rating 6Confidence 4

Strengths

1. The five domain problem is spanning from excavation to synthesis and coarse-grained to fine-grained which is novelty to the evaluation of a LMM. 2. The baseline consists of most proprietary and open-source LMMs and the results are consist to other LMM benchmarks.

Weaknesses

1. Some domain problem setting, namely rejoining and deciphering, is not convincing. 2. Lack of the results of fine-tuned open-source LLMs which is quiet important to a domain specific benchmark. 3. Among the analysis of each domain problem or each scenario, the essential reason why open LMM perform like that is lacking which is important to the community beyond a specific benchmark.

Videos

OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?· slideslive

Taxonomy

TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques