Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
Yan Hon Michael Chung, Donghyeok Choi

TL;DR
This paper develops a cost-effective OCR system for the endangered Manchu language by fine-tuning vision-language models on synthetic data, achieving high accuracy on real-world historical documents and enabling digital humanities research.
Contribution
It introduces a novel fine-tuning approach for vision-language models on synthetic data to effectively recognize historical Manchu documents, facilitating low-resource language digitization.
Findings
LLaMA-3.2-11B achieved 93.1% accuracy on real handwritten documents.
Synthetic training data enabled effective domain transfer from synthetic to real data.
Compared to traditional methods, the proposed approach maintains high accuracy with lower resource requirements.
Abstract
Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Digital Humanities and Scholarship
