Natural Language Processing Models for Robust Document Categorization
Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek

TL;DR
This paper evaluates machine learning models for automated text classification, comparing their accuracy and efficiency, and demonstrates a practical system for unbalanced document categorization, highlighting BiLSTM as the most balanced approach.
Contribution
It provides a comparative analysis of Naive Bayes, BiLSTM, and BERT models for document classification, and implements a functional system demonstrating real-world application.
Findings
BERT achieved over 99% accuracy but with high computational cost.
BiLSTM balanced accuracy (~98.56%) and efficiency, suitable for practical use.
Naive Bayes was fastest but least accurate, around 94.5%.
Abstract
This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Imbalanced Data Classification Techniques · Handwritten Text Recognition Techniques
