TL;DR
This paper introduces Synchronously Self-Reviewing (SSR), a novel fine-tuning method that enhances Multimodal Large Language Models' ability to perform document image machine translation by leveraging and preserving their OCR skills.
Contribution
The paper proposes SSR, a new fine-tuning paradigm that improves DIMT performance and mitigates catastrophic forgetting of OCR abilities in MLLMs.
Findings
SSR improves DIMT translation accuracy.
SSR helps preserve OCR proficiency during fine-tuning.
Enhanced generalization on OCR and DIMT tasks.
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
