An open dataset for oracle bone script recognition and decipherment
Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu,, Jinpeng Wan, Haisu Guan, Zhebin Kuang, Lianwen Jin, Xiang Bai, Yuliang Liu

TL;DR
This paper introduces the HUST-OBC dataset, a large collection of oracle bone script images, to facilitate AI-based decipherment of ancient Chinese characters, addressing the lack of high-quality datasets in this field.
Contribution
The paper presents a comprehensive, publicly available dataset of oracle bone characters, enabling future AI research for deciphering ancient scripts.
Findings
Dataset includes 140,053 images of oracle bone characters.
Contains images of both deciphered and undeciphered characters.
Provides a resource to advance AI-assisted decipherment methods.
Abstract
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years. The immense historical and cultural significance of these writings cannot be overstated. However, the passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts. With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option. Yet, progress in this area has been hindered by a lack of high-quality datasets. To address this issue, this paper details the creation of the HUST-OBC dataset. This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
