OCR-Agent: Agentic OCR with Capability and Memory Reflection
Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

TL;DR
This paper introduces OCR-Agent, an iterative self-correction framework for Vision-Language Models that enhances their reasoning and accuracy through Capability and Memory Reflection, outperforming existing models on OCR benchmarks.
Contribution
The paper presents a novel self-correction framework with Capability and Memory Reflection, enabling models to diagnose errors, review past attempts, and improve answers without extra training.
Findings
Outperforms state-of-the-art models on OCRBench v2
Achieves top results in Visual Understanding and Reasoning
Enhances reasoning robustness through structured self-reflection
Abstract
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
