Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Yupu Liang; Yaping Zhang; Zhiyang Zhang; Zhiyuan Chen; Yang Zhao; Lu Xiang; Chengqing Zong; Yu Zhou

arXiv:2507.08309·cs.CL·July 14, 2025

Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Yupu Liang, Yaping Zhang, Zhiyang Zhang, Zhiyuan Chen, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou

PDF

1 Video

TL;DR

This paper introduces Synchronously Self-Reviewing (SSR), a novel fine-tuning method that enhances Multimodal Large Language Models' ability to perform document image machine translation by leveraging and preserving their OCR skills.

Contribution

The paper proposes SSR, a new fine-tuning paradigm that improves DIMT performance and mitigates catastrophic forgetting of OCR abilities in MLLMs.

Findings

01

SSR improves DIMT translation accuracy.

02

SSR helps preserve OCR proficiency during fine-tuning.

03

Enhanced generalization on OCR and DIMT tasks.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency· underline