Natural Language Processing Models for Robust Document Categorization

Radoslaw Roszczyk; Pawel Tecza; Maciej Stodolski; Krzysztof Siwek

arXiv:2602.20336·cs.CL·February 25, 2026

Natural Language Processing Models for Robust Document Categorization

Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek

PDF

Open Access

TL;DR

This paper evaluates machine learning models for automated text classification, comparing their accuracy and efficiency, and demonstrates a practical system for unbalanced document categorization, highlighting BiLSTM as the most balanced approach.

Contribution

It provides a comparative analysis of Naive Bayes, BiLSTM, and BERT models for document classification, and implements a functional system demonstrating real-world application.

Findings

01

BERT achieved over 99% accuracy but with high computational cost.

02

BiLSTM balanced accuracy (~98.56%) and efficiency, suitable for practical use.

03

Naive Bayes was fastest but least accurate, around 94.5%.

Abstract

This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Imbalanced Data Classification Techniques · Handwritten Text Recognition Techniques