OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li; Zengmao Ding; Xiaobin Hu; Bang Li; Donghao Luo; Xu Peng; Taisong Jin; Yongge Liu; Shengwei Han; Jing Yang; Xiaoping He; Feng Gao; AndyPian Wu; SevenShu; Chaoyang Wang; Chengjie Wang

arXiv:2510.26114·cs.CV·October 31, 2025

OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang

PDF

4 Reviews

TL;DR

OracleAgent is a multimodal reasoning system that integrates large language models and a comprehensive OBS knowledge base to improve information retrieval and interpretation in Oracle Bone Script research, significantly aiding experts and advancing the field.

Contribution

This paper introduces OracleAgent, the first structured agent system for OBS that combines multimodal tools, a large annotated knowledge base, and LLMs to enhance research efficiency and accuracy.

Findings

01

OracleAgent outperforms existing multimodal models like GPT-4o in reasoning tasks.

02

It significantly reduces expert research time in OBS interpretation.

03

The system demonstrates effective multimodal retrieval and generation capabilities.

Abstract

As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- This paper introduce a large, curated domain KB (1.4M crops, 80K texts, 3K docs, dictionary; plus 15K pixel-level links) that, if released, could materially help the field. - A thoughtful, end-to-end framing of practical OBS workflows with the right primitives (modalities, retrieval targets, facsimiles). The Fig. 2 architecture and Fig. 3 interaction flow make it concrete. - The case study with precision/recall/coverage is stronger than a purely qualitative vignette.

Weaknesses

- Unfair: Most baselines are general-purpose MLLMs without access to the same domain KB; OracleAgent’s advantage could primarily stem from the private data + tools rather than the agent design. The paper doesn’t ablate “with/without KB/tools” vs. the same LLM. (No ablation reported.) - “OracleAgent-Bench” extends OBI-Bench, but details on train test separation vs. their massive KB are thin; risk of overlap or leakage isn’t audited. (The paper lists sources and sizes, not leakage checks.) - The “

Reviewer 02Rating 2Confidence 3

Strengths

1. OBS Research seems to be a challenging task where researchers navigate multiple large heterogeneous sources of using domain specific tools to perform required analysis. The proposed agentic framework automates this process and shows high accuracy on OBS tasks, significantly expediting research work. 2. The agentic system is well designed with a focus on knowledge base curation, followed by tool specification and orchestration of these into an agentic system where an LLM makes and executes a p

Weaknesses

1. Limited novelty and insights for the ML community as the agentic task is highly specialized to OBS research, is relatively simple where planning needs to only consider 7 tools which have clear purposes, in contrast to desktop/code agentic frameworks where complex reasoning and planning is required using a large number of tools. The saturated benchmark scores also indicate this. 2. The evaluation pipeline used for MLLMs is unclear. Oracle Agent is evaluated with full access to a domain KB and

Reviewer 03Rating 6Confidence 3

Strengths

Novel System Integration: First multimodal agent framework unifying LLM reasoning with domain-specific tools for ancient script decipherment. Rich Multimodal Knowledge Base: Large, expertly annotated dataset linking images and texts across multiple sources. Comprehensive Evaluation: Benchmarks across five major tasks with quantitative superiority over GPT-4o and other MLLMs. High Practical Value: Demonstrated ability to replicate and accelerate expert workflows; clear cultural and scientific

Weaknesses

Limited Algorithmic Novelty: Most modules rely on pre-existing models; the innovation lies mainly in orchestration rather than method. Reproducibility Details: Some implementation details (training hyperparameters, LLM fine-tuning procedures, tool-calling logic) are missing. Domain Generalization: It remains unclear how easily OracleAgent can be transferred beyond OBS to other semiotic or multimodal domains. Ablation Studies: The paper would benefit from an analysis isolating the contribution

Reviewer 04Rating 6Confidence 2

Strengths

1. OracleAgent combines LLMs with domain-specific tools (e.g., YOLO-based detection and CycleGAN for denoising) to handle complex, multi-step workflows. The agent’s dynamic planning capability allows flexible tool orchestration based on user queries, simulating expert reasoning. 2. The construction of a large-scale, multimodal OBS knowledge base is a significant contribution. It supports fine-grained retrieval across rubbings, facsimiles, and texts, addressing critical bottlenecks in resource ac

Weaknesses

1. The LLM-driven task planning process is opaque. For example, the paper does not explain how OracleAgent prioritizes tools (e.g., why it chooses denoising before retrieval for a cropped character) or how experts can verify/debug its decisions. 2. While experiments cover standard tasks, there is no analysis of OracleAgent’s performance on severely damaged rubbings or undeciphered characters. 3. While case studies simulate expert workflows, there is no feedback from actual OBS scholars (e.g., u

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.