Beyond Document Page Classification: Design, Datasets, and Challenges
Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko,, Marie-Francine Moens

TL;DR
This paper emphasizes the importance of realistic multi-page document classification benchmarks, introduces new datasets, and discusses challenges and future directions for practical applications.
Contribution
It formalizes various multi-page document classification tasks, highlights dataset gaps, and advocates for more comprehensive evaluation methodologies.
Findings
Current benchmarks are outdated and insufficient for real-world documents.
Proposed datasets better reflect practical multi-page document scenarios.
Evaluation should include calibration, complexity, and distribution shift assessments.
Abstract
This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested (: multi-channel, multi-paged, multi-industry; : class distributions and label set variety) and in classification tasks considered (: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Beyond Document Page Classification: Design, Datasets, and Challenges· youtube
Taxonomy
TopicsText and Document Classification Technologies · Handwritten Text Recognition Techniques · Machine Learning and Data Classification
