FireRed-OCR Technical Report

Hao Wu; Haoran Lou; Xinyue Li; Zuodong Zhong; Zhaojun Sun; Phellon Chen; Xuanhe Zhou; Kai Zuo; Yibo Chen; Xu Tang; Yao Hu; Boxiang Zhou; Jian Wu; Yongji Wu; Wenxin Yu; Yingmiao Liu; Yuhao Huang; Manjie Xu; Gang Liu; Yidong Ma; Zhichao Sun; Changhao Qiao

arXiv:2603.01840·cs.CV·March 3, 2026

FireRed-OCR Technical Report

Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao

PDF

Open Access 2 Models

TL;DR

FireRed-OCR is a framework that transforms general vision-language models into high-precision OCR systems for complex documents, using a novel data synthesis and progressive training approach.

Contribution

The paper introduces FireRed-OCR, a new method that converts general VLMs into specialized OCR models with a unique data factory and multi-stage training strategy.

Findings

01

Achieves 92.94% on OmniDocBench v1.5, surpassing existing models.

02

Effectively handles complex document layouts and rare types.

03

Demonstrates significant improvements in structural accuracy and robustness.

Abstract

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Topic Modeling