Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Siqi Yang, Lin Ma

TL;DR
This paper introduces format decoupled reinforcement learning (FD-RL) to improve OCR performance on formatted documents by leveraging entropy patterns and format-specific rewards, achieving state-of-the-art results on OmniDocBench.
Contribution
The paper proposes a novel FD-RL approach that uses entropy-based filtering and format-specific rewards to enhance OCR accuracy on complex, format-sensitive documents.
Findings
FD-RL achieves 90.41 on OmniDocBench, setting a new record.
Entropy patterns reveal OCR struggles with formatted text.
Ablation studies validate the effectiveness of the proposed strategies.
Abstract
Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
