Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Qinwu Xu; Xin Liu; Yifan Jiang; Haoyu Ren

arXiv:2605.16409·cs.CV·May 19, 2026

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Qinwu Xu, Xin Liu, Yifan Jiang, Haoyu Ren

PDF

TL;DR

This paper introduces an OCR-aware training framework for multilingual multimodal models that enhances text recognition, translation accuracy, and robustness in challenging visual conditions using synthetic data, fine-tuning, and structured reasoning prompts.

Contribution

It proposes a novel OCR-aware training approach combining synthetic data, fine-tuning with LoRA, and visual chain-of-thought prompting to improve multilingual OCR and reasoning in multimodal models.

Findings

01

Significant improvements in OCR completeness and translation accuracy.

02

Enhanced robustness in degraded visual conditions like blur and occlusion.

03

Better visual-text grounding compared to baseline models.

Abstract

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.