Evaluation of Neural Network Classification Systems on Document Stream
Joris Voerman, Aurelie Joseph, Mickael Coustaty, Vincent Poulain d, Andecy, Jean-Marc Ogier

TL;DR
This paper evaluates neural network-based document classification in realistic industrial scenarios, revealing significant performance drops for underrepresented classes and highlighting the need for adaptation.
Contribution
It analyzes the efficiency of NN-based document classification in sub-optimal, real-world conditions, comparing image and text-based approaches across various challenging scenarios.
Findings
Performance drops significantly in realistic cases
NN systems overfit well-represented classes
Underrepresented classes are poorly classified
Abstract
One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a company document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
