Naive Bayes with Correlation Factor for Text Classification Problem
Jiangning Chen, Zhibo Dai, Juntao Duan, Heinrich Matzinger, Ionel, Popescu

TL;DR
This paper introduces a modified Naive Bayes classifier that incorporates a correlation factor to improve text classification accuracy, especially with small training datasets.
Contribution
It proposes a novel Naive Bayes-based method with a correlation factor to enhance performance on limited data.
Findings
Improved accuracy over traditional Naive Bayes on real-world data
Effective handling of small training datasets
Correlation factor enhances class distinction
Abstract
Naive Bayes estimator is widely used in text classification problems. However, it doesn't perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditional Naive Bayes in real world data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Naive Bayes with Correlation Factor for Text Classification Problem
Jiangning Chen
School of Mathematics
*Georgia Institute of Technology
*Atlanta, US
Zhibo Dai
School of Mathematics
*Georgia Institute of Technology
*Atlanta, US
Juntao Duan
School of Mathematics
*Georgia Institute of Technology
*Atlanta, US
Heinrich Matzinger
School of Mathematics
*Georgia Institute of Technology
*Atlanta, US
Ionel Popescu
School of Mathematics
*Georgia Institute of Technology
*Atlanta, US
Abstract
Naive Bayes estimator is widely used in text classification problems. However, it doesn’t perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditional Naive Bayes in real world data.
Index Terms:
Naive Bayes, correlation factor, text classification, insufficient training set
I Introduction
Text classification problem has long been an interesting research field, the aim of text classification is to develop algorithm to find the categories of given documents. Text classification has many applications in natural language processing (NLP), such as spam filtering, email routing, and sentimental analysis. Despite intensive work, there still remains an open problem today.
This problem has been studied from many aspects, including: supervised classification problem, if we are given the labeled training data; unsupervised clustering problem, if we only have documents without labeling; feature selection.
For supervised problem, if we assume that all the categories follow independent multinomial distributions, and each document is a sample generated by that distribution. Then a straight forward idea is to use some linear models to distinguish them, such as support vector machine (SVM)[1, 2], which is used to find the ”maximum-margin hyper-plane” that divides the documents with different labels. The algorithm is defined so that the distance between the hyper-plane and the nearest sample from each group is maximized. The hyper-plane can be written as the set of documents vector satisfying:
[TABLE]
where is the normal vector to the hyper-plane. Under the same assumption, another effective classifier, using scores based on the probability of given documents conditioned on categories, is called Naive Bayesian classifier[3, 4, 5]. This classifier learns from training data to estimate the distribution of each categories, then we can compute the conditional probability of each document given the class label by applying Bayes rule, then the prediction of the class is done by choosing the highest posterior probability. The algorithm to get the label for a given document is given by:
[TABLE]
Given a huge data set, we also consider using deep learning models such as Recurrent Neural Network (RNN)[6, 7] to do classification, which includes more information such as the order of words and semantic representations.
For unsupervised problem, we have traditional method SVD (Singular Value Decomposition)[8] for the dimension reduction and clustering. There also exist some algorithms based on EM algorithm, such as pLSA (Probabilistic latent semantic analysis)[9], which considers the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions:
[TABLE]
where and are observed words and documents, and is the words’ topic. As mentioned above, parameters here are learned by EM algorithm. Using the same idea, but assuming that the topic distribution has sparse Dirichlet prior, we have algorithm LDA (Latent Dirichlet allocation)[10]. The sparse Dirichlet priors encode the intuition that documents cover only a small set of topics and that topics use only a small set of words frequently. In practice, this results in a better disambiguation of words and a more precise assignment of documents to topics.
Naive Bayes estimator is a widely used estimator, however, it requires plenty of well labeled data for training purposes. To tackle this problem, this paper proposes a novel estimation method. In the remainder of this paper, we firstly summarize the Naive Bayes estimator in section III. Then we discuss the error of the Naive Bayes estimator in Theorem III.1 and demonstrate that it is unbiased. In section IV, we propose a novel estimation method (see equation 11) called Naive Bayes with correlation factor. It addresses the problem in many real world text classification applications that have only limited available training data. Furthermore, in theorem IV.1 we show the error of the new estimator is controlled by the correlation factor and the variation has a smaller order compared with Naive Bayes estimator. In section V, we show results of simulations, which demonstrates the performance of our method presented in section IV. Finally section VI concludes our work and mentions possible future work.
II General Setting
Consider a classification problem with the sample (document) set , and the class set with different classes:
[TABLE]
Assume we have totally different words, thus for each document , we have:
[TABLE]
Define as our label vector. For document is in class , we have . Notice that for a single label problem, we have: .
For a test document , our target is to predict:
[TABLE]
given training sample set , where is the parameter matrix and is the likelihood function of document in class .
III Naive Bayes classifier in text classification problem
In this section we will discuss the properties of estimator derived from traditional Naive Bayes method. Let class with centroid and satisfies: . Assuming independence of the words, the most likely class for a document is computed as:
[TABLE]
This gives the classification criteria once is estimated, namely finding the largest among
[TABLE]
Now we shall derive an maximum likelihood estimator for . For a class , we have the standard likelihood function:
[TABLE]
Take logarithm for both sides, we obtain the log-likelihood function:
[TABLE]
We would like to solve optimization problem:
[TABLE]
The problem (4) can be explicitly solved by Lagrange Multiplier, for class , we have , where:
[TABLE]
For estimator , we have following theorem.
Theorem III.1
Assume we have normalized length of each document, that is: for all documents , the estimator (5) satisfies following properties:
* is unbiased.* 2. 2.
.
Proof:
With assumption , we can rewrite (5) as:
[TABLE]
Since is multinomial distribution in class , we have: , and
[TABLE]
Thus is unbiased. 2. 2.
By (1), we have:
[TABLE]
Then notice
[TABLE]
where .
Since:
[TABLE]
and
[TABLE]
Plugging them into (6) obtains:
[TABLE]
thus: .
∎
IV Naive Bayes with correlation factor
From Theorem.III.1, we can see that traditional Naive Bayes estimator is an unbiased estimator with variance . Now we will try to find an estimator, and prove that it can perform better than traditional Naive Bayes estimator.
Our basic idea is that, even for a single labeling problem, a document usually contains words from different classes, thus it should include feature from different classes. However, our label in training set does not reflect that information since only one component of is 1. Thus, we would like to replace by in Naive Bayes likelihood function 2 with some optimized to get our new likelihood function :
[TABLE]
Notice that to compute of a given class in our estimator, instead of just using documents in as Naive Bayes estimator, we will use every .
Take logarithm for both sides of 7, we obtain the log-likelihood function:
[TABLE]
Similar to Naive Bayes estimator, We would like to solve optimization problem:
[TABLE]
Let:
[TABLE]
by Lagrange multiplier, we have:
[TABLE]
plug in, we obtain:
[TABLE]
Solve (10), we got the solution of optimization problem (9):
[TABLE]
For estimator , we have the following result:
Theorem IV.1
Assume for each class, we have prior distributions with , and we have normalized length for each document, that is: . The estimator (11) satisfies following property:
* is biased, with: * 2. 2.
**
Proof:
With assumption , we have:
[TABLE]
Thus:
[TABLE]
This shows our estimator is biased. The error is controlled by . When converges to 0, our estimator converges to the unbiased Naive Bayes estimator. We can also derive a lower bound for the square error:
[TABLE] 2. 2.
For variance part, since
[TABLE]
we have:
[TABLE]
∎
We can see that is in , which means it convergent faster than standard Naive Bayes , however, since , it is not an unbiased estimator.
V Experiment
V-A Simulation with Fixed Correlation Factor
We applied our method on top 10 topics of single labeled documents in Reuters-21578 data[11], and 20 news group data[12]. we compare the result of traditional Naive Bayes estimator (5): and our estimator (11): . In this simulation,our correlation factor is chosen to be for Figure.1, Figure.2 and Figure.3.
First of all, we run both algorithms on these two sample sets. We know that when sample size becomes large enough, our estimator actually convergences into something else. But when training set is small, our estimator should converge faster. Thus we first take the training size relatively small. See Figure.1 and Figure.1. According to the simulation, we can see our method is more accurate for most of the classes, and more accurate in average.
Then we test our estimator with larger dataset. In our analysis above, we know that as dataset becomes large enough, our estimator converges to something else, so we expect a better result with traditional Naive Bayes estimator. See Figure.2 and Figure.2. According to the simulation, we can see for 20 news group, traditional Naive Bayes performs better than our method, but our method is still more accurate than Naive Bayes in Reuter’s data. The reason might be that we have a huge unbalance dataset in Reuter’s data, 90% of the training set is still not large enough for many classes.
Finally, We apply same training set with training size 10 and test the accuracy on training set instead of test set. We find traditional Naive Bayes estimator actually achieves better result, which means it might have more over-fitting problems. This might be the reason why our method works better when dataset is not too large: adding the correlation factor helps us bring some uncertainty in training process, which helps avoid over-fitting. See Figure.3 and Figure.3.
V-B Simulation with Different Correlation Factor
In our estimator (11), we need to determine how to choose correlation factor . An idea is to choose to minimize the variance (12). Taking derivative of (12) with respect to and setting it to be 0, we find satisfies:
[TABLE]
that is:
[TABLE]
We can see from (14) that our correlation factor should be less than 1. In our simulation, we notice that when we choose correlation factor to be around 0.1, we get best accuracy for our estimation. See Figure.4 and Figure.4.
VI Conclusion
In this paper, we modified the traditional Naive Bayes estimator with a correlation factor to obtain a new estimator, which is biased but with a smaller variance. We applied our estimator in text classification problems, and showed that it works better when training data set is small.
There are several important questions related our estimator:
We have a parameter, correlation factor , in our estimator (11). In Section V, we have some simulations when , and further show what happened when ranges from , but we don’t have theoretical result about how to choose . One important question is how can we choose in different problems, in each of these problems, can we solve explicitly? 2. 2.
We only test our result in Reuter’s data [11] and 20 news group [12], these datasets are news from newspapers, which means they are highly correlated to each other. Will our estimator still work in other more independent datasets? 3. 3.
We can only use our method in single labeled dataset so far, it would be interesting to see if we can extend our result in partial labeled dataset or multi-labeled dataset.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning , vol. 20, no. 3, pp. 273–297, 1995.
- 2[2] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European conference on machine learning . Springer, 1998, pp. 137–142.
- 3[3] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine learning , vol. 29, no. 2-3, pp. 131–163, 1997.
- 4[4] P. Langley, W. Iba, K. Thompson et al. , “An analysis of bayesian classifiers,” in Aaai , vol. 90, 1992, pp. 223–228.
- 5[5] J. Chen, H. Matzinger, H. Zhai, and M. Zhou, “Centroid estimation based on symmetric kl divergence for multinomial text classification problem,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) . IEEE, 2018, pp. 1174–1177.
- 6[6] D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification,” in Proceedings of the 2015 conference on empirical methods in natural language processing , 2015, pp. 1422–1432.
- 7[7] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classification with multi-task learning,” ar Xiv preprint ar Xiv:1605.05101 , 2016.
- 8[8] R. Albright, “Taming text with the svd,” SAS Institute Inc , 2004.
