OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen; Zeyu Zhang; Xingdou Bian; Hongjie Zhu; Lulu He; Layi Shama; Daji Ergu; Ying Cai

arXiv:2602.21053·cs.CV·February 25, 2026

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

PDF

Open Access

TL;DR

This paper introduces OCR-Agent, an iterative self-correction framework for Vision-Language Models that enhances their reasoning and accuracy through Capability and Memory Reflection, outperforming existing models on OCR benchmarks.

Contribution

The paper presents a novel self-correction framework with Capability and Memory Reflection, enabling models to diagnose errors, review past attempts, and improve answers without extra training.

Findings

01

Outperforms state-of-the-art models on OCRBench v2

02

Achieves top results in Visual Understanding and Reasoning

03

Enhances reasoning robustness through structured self-reflection

Abstract

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications