Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
Qinwu Xu, Xin Liu, Yifan Jiang, Haoyu Ren

TL;DR
This paper introduces an OCR-aware training framework for multilingual multimodal models that enhances text recognition, translation accuracy, and robustness in challenging visual conditions using synthetic data, fine-tuning, and structured reasoning prompts.
Contribution
It proposes a novel OCR-aware training approach combining synthetic data, fine-tuning with LoRA, and visual chain-of-thought prompting to improve multilingual OCR and reasoning in multimodal models.
Findings
Significant improvements in OCR completeness and translation accuracy.
Enhanced robustness in degraded visual conditions like blur and occlusion.
Better visual-text grounding compared to baseline models.
Abstract
Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
