Weighted Naive Bayes Model for Semi-Structured Document Categorization
Pierre-Fran\c{c}ois Marteau (VALORIA), Gilbas M\'enier (VALORIA),, Eugen Popovici (VALORIA)

TL;DR
This paper introduces a weighted Bayesian model for classifying semi-structured documents, demonstrating that incorporating structural context improves accuracy over traditional naive Bayes methods.
Contribution
It develops a formal recursive model integrating document structure into Bayesian classification and shows its effectiveness through experiments on textual data.
Findings
Structural context significantly improves classification accuracy.
The model competes well with SVM on Reuters-21578 data.
Weighting strategies can enhance model performance.
Abstract
The aim of this paper is the supervised classification of semi-structured data. A formal model based on bayesian classification is developed while addressing the integration of the document structure into classification tasks. We define what we call the structural context of occurrence for unstructured data, and we derive a recursive formulation in which parameters are used to weight the contribution of structural element relatively to the others. A simplified version of this formal model is implemented to carry out textual documents classification experiments. First results show, for a adhoc weighting strategy, that the structural context of word occurrences has a significant impact on classification results comparing to the performance of a simple multinomial naive Bayes classifier. The proposed implementation competes on the Reuters-21578 data with the SVM classifier associated or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling
