DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu; Zhibo Yang; Yuliang Liu; Xiang Bai

arXiv:2508.08589·cs.CV·August 13, 2025

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai

PDF

Open Access

TL;DR

DocThinker introduces a rule-based reinforcement learning framework for multimodal large language models, enabling dynamic, explainable reasoning processes that improve adaptability, transparency, and generalization in document understanding tasks.

Contribution

It presents a novel RL-based approach that replaces static reasoning templates, allowing for autonomous, explainable, and adaptable inference in multimodal document analysis.

Findings

01

Significantly improves generalization across benchmarks.

02

Produces more explainable and human-understandable reasoning steps.

03

Mitigates catastrophic forgetting and enhances adaptability.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications