MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Yuyi Zhang; Yongxin Shi; Peirong Zhang; Yixin Zhao; Zhenhua Yang; Lianwen Jin

arXiv:2506.04807·cs.CV·June 6, 2025

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Yuyi Zhang, Yongxin Shi, Peirong Zhang, Yixin Zhao, Zhenhua Yang, Lianwen Jin

PDF

1 Repo

TL;DR

MegaHan97K introduces the largest-scale Chinese character dataset with over 97,000 categories, enabling research into mega-category recognition challenges and advancing OCR technology for Chinese language preservation.

Contribution

It provides the first comprehensive dataset supporting GB18030-2022, addressing long-tail distribution, and benchmarking challenges in mega-category Chinese character recognition.

Findings

01

Identifies increased storage and recognition challenges in mega-category OCR.

02

Highlights the difficulty of zero-shot learning for large-scale Chinese character sets.

03

Provides a foundation for future research in large-scale pattern recognition.

Abstract

Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scut-dlvclab/megahan97k
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.