PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

Kui Huang; Xinrong Chen; Wenyu Lv; Jincheng Liao; Guanzhong Wang; Yi Liu

arXiv:2506.18023·cs.CV·June 26, 2025

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu

PDF

Open Access 1 Repo

TL;DR

PP-DocBee2 significantly advances multimodal document understanding by improving data quality, visual feature fusion, and inference efficiency, leading to substantial performance gains and reduced latency.

Contribution

It introduces a novel data filtering strategy and an enhanced feature fusion method, setting new standards for multimodal document understanding models.

Findings

01

11.4% performance improvement on Chinese business documents

02

73.0% reduction in inference latency

03

Enhanced data quality and feature fusion strategies

Abstract

This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PaddlePaddle/PaddleMIX
paddleOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling