DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Md Mofijul Islam; Md Sirajus Salekin; Nivedha Balakrishnan; Vincil C. Bishop III; Niharika Jain; Spencer Romo; Bob Strahan; Boyi Xie; Diego A. Socolinsky

arXiv:2602.15958·cs.CL·February 19, 2026

DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop III, Niharika Jain, Spencer Romo, Bob Strahan, Boyi Xie, Diego A. Socolinsky

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DocSplit, a comprehensive benchmark dataset and evaluation framework for the challenging task of document packet splitting, addressing real-world complexities and evaluating multimodal large language models.

Contribution

It provides the first extensive benchmark dataset and novel metrics for document packet splitting, formalizes the task, and evaluates current models' performance on complex document scenarios.

Findings

01

Significant performance gaps in current models' ability to split complex document packets.

02

The datasets cover diverse document types, layouts, and multimodal settings.

03

The benchmark facilitates future research in document understanding for various domains.

Abstract

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

amazon/doc_split
dataset· 34k dl
34k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Network Packet Processing and Optimization · Advanced Neural Network Applications