TL;DR
MegaHan97K introduces the largest-scale Chinese character dataset with over 97,000 categories, enabling research into mega-category recognition challenges and advancing OCR technology for Chinese language preservation.
Contribution
It provides the first comprehensive dataset supporting GB18030-2022, addressing long-tail distribution, and benchmarking challenges in mega-category Chinese character recognition.
Findings
Identifies increased storage and recognition challenges in mega-category OCR.
Highlights the difficulty of zero-shot learning for large-scale Chinese character sets.
Provides a foundation for future research in large-scale pattern recognition.
Abstract
Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
