OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai

TL;DR
OBI-Bench is a comprehensive benchmark designed to evaluate large multi-modal models on complex oracle bone inscription tasks, highlighting current challenges and potential in ancient script research.
Contribution
The paper introduces OBI-Bench, a new benchmark with diverse data and tasks, to systematically assess LMMs' abilities in ancient oracle bone script analysis.
Findings
Current LMMs struggle with fine-grained perception tasks.
Some models perform comparably to untrained humans in deciphering.
Even top models are far from expert-level performance.
Abstract
We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single characters, and handprinted characters. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts.…
Peer Reviews
Decision·ICLR 2025 Poster
1. Novel (and important) Application Domain - Important systematic evaluation of LMMs for ancient script analysis, addressing a significant real-world problem in historical research with potential to accelerate archaeological research and cultural heritage preservation. 2. Comprehensive fine-grained benchmark - covering tasks like recognition, rejoining, classification, retrieval, and deciphering 3. Specific data curation for each task helps to answer queries specific to each of them.
1. The overall technical importance of the benchmark might be limited - the evaluation is very sensitive to the query-answer form (spectrum of questions), even a little change or adding some context can influence the outcomes drastically. Although the work is quite extensive, a decent foray into how prompt engineering can impact the performance could have been a nice addition to the work. Eg: Take the example of Deciphering task, use the best open-source and proprietary models (as already done i
1. This paper covers five specific areas of oracle bone script exploration, providing a comprehensive summary of prior work in the field. 2. The evaluation dimensions of this paper are diverse, it attempts to explore more fine-grained capabilities through the design of questions such as "How" and "Where" questions. Although the design of these two types of questions may not be critical, this approach seems to offer a potential framework for exploring process supervision mechanisms for models in
1. There are some long-tail issues in the data volume of each task, particularly with an excessive amount for Recognition and insufficient data for Deciphering. 2. Despite the diversity of tasks, would it be possible to provide an overall score to evaluate the comprehensive ability of LMMs? This score should not be a simple average but should also take into account that LMMs are still in the early stages in the field of oracle bone scripts. Consideration should be given to the lower scores that
1. The five domain problem is spanning from excavation to synthesis and coarse-grained to fine-grained which is novelty to the evaluation of a LMM. 2. The baseline consists of most proprietary and open-source LMMs and the results are consist to other LMM benchmarks.
1. Some domain problem setting, namely rejoining and deciphering, is not convincing. 2. Lack of the results of fine-tuned open-source LLMs which is quiet important to a domain specific benchmark. 3. Among the analysis of each domain problem or each scenario, the essential reason why open LMM perform like that is lacking which is important to the community beyond a specific benchmark.
Videos
Taxonomy
TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques
