TL;DR
This paper introduces microTC, a minimalistic, domain- and language-independent text classification framework that uses simple transformations and supervised learning to achieve competitive accuracy across diverse tasks and datasets.
Contribution
The paper presents microTC, a novel, easy-to-implement text classification system that outperforms or matches state-of-the-art methods on 30 diverse datasets with minimal preprocessing.
Findings
microTC achieved the best performance in 20 datasets
It obtained competitive results in 10 datasets
The approach is accessible without machine learning or NLP expertise
Abstract
A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely microTC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a…
| name | language | #documents | #classes | performance | ||
| total | train | test | measure | |||
| Topic Classification | ||||||
| R8 | English | 7,674 | 70% | 30% | 8 | macro-F1 |
| R10 | English | 8,008 | 70% | 30% | 10 | macro-F1 |
| R52 | English | 9,100 | 70% | 30% | 52 | macro-F1 |
| News-4 | English | 13,919 | 70% | 30% | 4 | macro-F1 |
| News-20 | English | 20,000 | 70% | 30% | 20 | macro-F1 |
| WebKB | English | 4,199 | 70% | 30% | 4 | macro-F1 |
| CADE | Portuguese | 40,983 | 70% | 30% | 12 | macro-F1 |
| Spam Identification | ||||||
| Ling-Spam | English | 2,893 | — 10-fold — | 2 | macro-F1 | |
| PUA | English† | 1,142 | — 10-fold — | 2 | macro-F1 | |
| PU1 | English† | 1,099 | — 10-fold — | 2 | macro-F1 | |
| PU2 | English† | 721 | — 10-fold — | 2 | macro-F1 | |
| PU3 | mixed† | 4,139 | — 10-fold — | 2 | macro-F1 | |
| Author Profiling | ||||||
| PAN’13 Gender & Age group | English | 242,040 | 236,600 | 25,440 | 2 & 3 | accuracy |
| Spanish | 84,060 | 75,900 | 8,160 | 2 & 3 | accuracy | |
| PAN’17‡ Gender & Language Variety | Arabic | - | 2,400 | - | 2 & 4 | accuracy |
| English | - | 3,600 | - | 2 & 6 | accuracy | |
| Spanish | - | 4,200 | - | 2 & 7 | accuracy | |
| Portuguese | - | 1,200 | - | 2 & 2 | accuracy | |
| Authorship Attribution | ||||||
| CCA | English | 1,000 | 500 | 500 | 10 | macro-F1 |
| NFL | English | 97 | 52 | 42 | 3 | macro-F1 |
| Business | English | 175 | 85 | 90 | 6 | macro-F1 |
| Poetry | English | 200 | 145 | 55 | 6 | macro-F1 |
| Travel | English | 172 | 112 | 60 | 4 | macro-F1 |
| Cricket | English | 158 | 98 | 60 | 4 | macro-F1 |
| Multilingual Sentiment Analysis | ||||||
| Arabic | Arabic | 2,000 | — 10-folds — | 3 | macro-F1 | |
| German | German | 91,502 | — 10-folds — | 3 | macro-F1 | |
| Portuguese | Portuguese | 86,062 | — 10-folds — | 3 | macro-F1 | |
| Russian | Russian | 69,100 | — 10-folds — | 3 | macro-F1 | |
| Spanish | Spanish | 19,767 | — 10-folds — | 3 | macro-F1 | |
| Swedish | Swedish | 49,255 | — 10-folds — | 3 | macro-F1 | |
| Task | English | Spanish | Avg. | |
|---|---|---|---|---|
| Age | 0.6605 | 0.6897 | 0.6751 | |
| TC | Gender | 0.5867 | 0.6750 | 0.6309 |
| Joint | 0.3946 | 0.4587 | 0.4267 | |
| Age | 0.6572 | 0.6558 | 0.6565 | |
| Pastor L. | Gender | 0.5690 | 0.6299 | 0.5995 |
| Joint | 0.3813 | 0.4158 | 0.3985 | |
| Age | 0.6408 | 0.6430 | 0.6419 | |
| Santosh | Gender | 0.5652 | 0.6473 | 0.6063 |
| Joint | 0.3508 | 0.4208 | 0.3858 | |
| Age | 0.6491 | 0.4930 | 0.5711 | |
| Meina | Gender | 0.5921 | 0.5287 | 0.5604 |
| Joint | 0.3894 | 0.2549 | 0.3222 |
| Method | Task | Arabic | English | Spanish | Portuguese | Avg. |
|---|---|---|---|---|---|---|
| Gender | 0.7569 | 0.7938 | 0.7975 | 0.8038 | 0.7880 | |
| TC | Variety | 0.7894 | 0.8388 | 0.9364 | 0.9750 | 0.8849 |
| Joint | 0.6081 | 0.6704 | 0.7518 | 0.7850 | 0.7038 | |
| Gender | 0.8006 | 0.8233 | 0.8321 | 0.8450 | 0.8253 | |
| Basile et al. [22] | Variety | 0.8313 | 0.8988 | 0.9621 | 0.9813 | 0.9184 |
| Joint | 0.6831 | 0.7429 | 0.8036 | 0.8288 | 0.7646 | |
| Gender | 0.8031 | 0.8071 | 0.8193 | 0.8600 | 0.8224 | |
| Martinc et al. [23] | Variety | 0.8288 | 0.8688 | 0.9525 | 0.9838 | 0.9085 |
| Joint | 0.6825 | 0.7042 | 0.7850 | 0.8463 | 0.7545 | |
| Gender | 0.7838 | 0.8054 | 0.7957 | 0.8538 | 0.8097 | |
| Tellez et al. [31] | Variety | 0.8275 | 0.9004 | 0.9554 | 0.9850 | 0.9171 |
| Joint | 0.6713 | 0.7267 | 0.7621 | 0.8425 | 0.7507 |
| macro-F1 | |||||||
|---|---|---|---|---|---|---|---|
| Reuters-8C | Reuters-10C | Reuters-52C | News-4C | News-20C | WebKB | CADE | |
| Debole [10] | - | - | - | - | - | - | - |
| Escalante [13] | 0.9135 | 0.9184 | - | - | 0.6797 | 0.8879 | 0.4103 |
| Cummins [12, 13] | 0.8830 | 0.8759 | - | - | 0.6645 | 0.7197 | - |
| Lai CNN [14] | - | - | - | 0.9479 | - | - | - |
| Lai RNN [14] | - | - | - | 0.9649 | - | - | - |
| Hingmire[11] | - | - | - | - | - | 0.7190 | - |
| Cachopo [7] | - | - | - | - | - | - | - |
| TC | 0.9698 | 0.9662 | 0.6746 | 0.9432 | 0.8269 | 0.9098 | 0.5687 |
| language | macro- | accuracy | |
|---|---|---|---|
| Arabic | Salameh et al. [32] | - | 0.787 |
| Saif et al. [33] | - | 0.794 | |
| B4MSA (100%) | 0.642 | 0.799 | |
| TC (100%) | 0.641 | 0.792 | |
| German | Mozetič et al. [18] | - | 0.610 |
| B4MSA (89%) | 0.621 | 0.668 | |
| TC (89%) | 0.614 | 0.672 | |
| Portuguese | Mozetič et al. [18] | - | 0.507 |
| B4MSA (58%) | 0.557 | 0.561 | |
| TC (58%) | 0.562 | 0.566 | |
| Russian | Mozetič et al. [18] | - | 0.603 |
| B4MSA (69%) | 0.754 | 0.750 | |
| TC (69%) | 0.754 | 0.751 | |
| Swedish | Mozetič et al. [18] | - | 0.616 |
| B4MSA (93%) | 0.680 | 0.691 | |
| TC (93%) | 0.679 | 0.688 | |
| Spanish | B4MSA | 0.657 | 0.784 |
| TC | 0.649 | 0.780 |
| kind of | actual | actual | pred | pred |
|---|---|---|---|---|
| preprocessing | accuracy | macro-F1 | accuracy | macro-F1 |
| raw | 0.8265 | 0.8199 | 0.8968 | 0.8963 |
| all-terms | 0.8340 | 0.8260 | 0.9075 | 0.9056 |
| no-short | 0.8310 | 0.8235 | 0.9052 | 0.9034 |
| no-stopwords | 0.8373 | 0.8300 | 0.9099 | 0.9082 |
| stemmed | 0.8413 | 0.8344 | 0.9071 | 0.9058 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
An Automated Text Categorization Framework based on Hyperparameter Optimization
Eric S. Tellez
[email protected] CONACyT Consejo Nacional de Ciencia y Tecnología, Dirección de Cátedras, Insurgentes Sur 1582, Crédito Constructor 03940, Ciudad de México, México.INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Circuito Tecnopolo Sur No 112, Fracc. Tecnopolo Pocitos II, Aguascalientes 20313, México.
Daniela Moctezuma11footnotemark: 1
[email protected] Centro de Investigación en Geografía y Geomática “Ing. Jorge L. Tamayo”, A.C. Circuito Tecnopolo Norte No. 117, Col. Tecnopolo Pocitos II, C.P. 20313,. Aguascalientes, Ags, México.
Sabino Miranda-Jiménez11footnotemark: 1 22footnotemark: 2
Mario Graff 11footnotemark: 1 22footnotemark: 2
(April. 2017)
Abstract
A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely TC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of TC along with an extensive experimental comparison with relevant state-of-the-art methods. TC was compared on 30 different datasets. Regarding accuracy, TC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.
1 Introduction
Due to the large and continuously growing volume of textual data, automated text classification methods have taken an increasing interest of research community. Although many efforts have been proposed in this direction, it remains as an open problem. The arrival of massive data sources, like micro-blogging platforms, introduces new challenges where many of the prior techniques failed. Among the new challenges are: the volume and noisy nature of the data, the shortness of the texts that implies little context, the informal style also plagued of misspellings and lexical errors, among others.
These new data sources have made popular tasks such as sentiment analysis and user profiling. The sentiment analysis problem consists in determining the polarity of a given text, which can be a global polarity (about the whole text) or about a particular subject or entity. The user profiling task consists in, given a text, predicting some facts about the author, like her/his demographic information (e.g., gender, age, language or region). Such is the importance of these problems that in the research community several international competitions have been carried out in recent years. For example SemEval111http://alt.qcri.org/semeval2017/, TASS222http://www.sepln.org/workshops/tass/2016/tass2016.php and SENTIPOLC333http://www.di.unito.it/ tutreeb/sentipolc-evalita16/ are challenges for sentiment classifiers for Twitter data in English, Spanish, and Italian languages, respectively. PAN444http://pan.webis.de/ also opens calls for author profiling systems for English, Spanish and German languages. These problems are closely related to traditional text classification applications such as topic classification (e.g, classifying a news-like text into sports, politics, or economy), authorship attribution (e.g., identifying the author of a given text) and spam detection.
Usually, each of aforementioned problems is treated in a particular way, i.e., a method is proposed to solve adequately one classification task. Traditionally, this approach cannot generalize to other related task, and, consequently, the methods are dependent on the problem; however, it is worth to mention that this specialization produces a lot of insight about the problem’s domain. Conversely, in this contribution, we proposed a framework to create a text classifier regardless of both the domain and the language and based only a training set of labeled examples.
The idea of creating a text classifier almost independent of the language and domain is not novel, in fact, in our previous work [1], we introduced a combinatorial framework for sentiment analysis. There, aspects of language were considered such as stopwords and tokenizers with special attention to lexical structures for negations. Also, particularities of the domain like emoticons and emojis are considered. The presented manuscript is a generalization and formalization of our previous work; this allows us to simplify the entire framework to work independently of both the language and the particular task, and empower the use of more sophisticated text treatments whenever it is possible and necessary.
As stated above, we tackle the problem of creating text classifiers that work regardless of both the domain and the language, with nothing more than a training set to be learned. The general idea is to orchestrate a number of simple text transformations, tokenizers, a set of weighting schemes, along with a Support Vector Machine (SVM) as classifier to produce effective text classification. More detailed, we look at the problem of creating effective text classifiers as a combinatorial optimization problem; where there is a search space containing all possible combinations of different text transformations, tokenizers, and weighting procedures with their respective parameters, and, on this search space, a meta-heuristic is used to search for a configuration that produces a highly effective text classifier. This model selection procedure is commonly named in the literature as hyper-parameter optimization. To emphasize the simplicity of the approach, we named it micro Text Classification or simply TC.
This manuscript is organized as follows. The related work is presented in Section 2. Section 3 describes our contribution in depth. In Section 4, all the experimental details are described. In Section 5, we show an extensive experimental comparison of our approach with the relevant state-of-the-art methods over 30 different benchmarks. Finally, the conclusions are listed in Section 6.
2 Related work
Let us start by describing a typical text classifier which can be summarized as a set of few, but complex, parts [2]. Firstly, the input text is passed to a lexical analyzer that both parses and normalizes the text, it outputs a list of tokens that represent the input text. The lexical analyzer typically includes some simple transformation functions like the removal of diacritic symbols and lower casing the text, but it also can make use of sophisticated techniques like stemming, lemmatization, misspelling correction, etc. Whereas, the tokens are commonly represented by words, pairs or triplets of adjacent words (bigrams or trigrams), and in general, sequences of words (word n-grams). It is also possible to extend this approach to sequences of characters (character n-grams). When it is allowed to drop the middle words of word n-grams, we obtain skip-grams. The usage of these techniques is driven by the human knowledge of the particular problem being tackled. Also, it is worth to mention that the entire process is tightly linked to the input language.
Secondly, the output of the lexical analyzer is commonly used to create high dimensional vectors where each token of the vocabulary has a corresponding coordinate in the vector. So, the value of each coordinate is associated with the weight of that token. The traditional way of weighting is to use the local and global statistics of tokens, popular examples of this approach are TF, IDF, TFIDF, and Okapi BM25; alternatively, some information measures like the entropy are commonly used as weight terms. Many times it is desirable to reduce the dimension of the vector space, and several techniques can be used for that purpose, just like PCA [3] (Principal Component Analysis), and LSI [4] (Latent Semantic Indexing).
Finally, the output of the weighting scheme, is used to create a training set which can be learned by a classifier. A classifier is a machine learning algorithm that learns the instances of a training set . In more detailed, the training set is a finite number of inputs and outputs, i.e., where represents the -th input, and is the associated output. The objective is to find a function such that and that could be evaluated in any element of the input space. In general, it is not possible to find a function that learns , perfectly. Consequently, a good classifier finds a function that minimizes an error function or maximizes a score function.
Perhaps, one of the first generic text classifier was proposed by Rocchio [5] that works by generating object prototypes based on centroids of a Voronoi partition over TFIDF vectors. This strategy shows the effort to reduce the necessary memory to fit in the hardware available at that time. Rocchio uses the nearest neighbor classifiers over prototypes to perform the predictions, the preprocessing of the text was left to the expertise of the user. Rocchio was the baseline and the study object in the area for a long time; such is the case of the work presented by Joachims [6], which describes a probabilistic analysis of the Rocchio algorithm.
With the purpose of improving the quality of the text classification task, Cardoso [7] proposes the use of centroids to enhance the power of several typical classifiers, such as kNN (k-nearest neighbors) and SVM (Support Vector Machines). Also, Cardoso published a number of datasets in various preprocessing stages, which are popular among the text classification community because using them allows focusing on the weighting and classification algorithms, avoiding to tackle the text processing problem.
In [8], machine learning is used to create a spam detector. The proposed method uses a combination of a set of features, preprocessing steps or setup details, such as using lemmatization or not, using stop-list or not, keywords patterns, varying the length of the training corpus, etc. A similar work is presented by Androutsopoulos et al. in [9].
In the topic classification task, [10] presents an experimental scheme with the Reuters dataset and three machine learning methods (Rocchio algorithm, k-NN, and SVM), and also, three-term selection functions (information gain, chi-square and gain ratio). [11] proposes a topic modelling algorithm based on Latent Dirichlet Allocation (LDA) which assign one topic to an unlabeled document. Also, a combination of LDA and Expectation-Maximization (EM) algorithm is proposed.
Another approach to text classification is to move the focus from text processing and text classification, to improve the term-weighting; this is a successful strategy followed by recent works. Cummins [12] proposes a method based on Genetic Programming to determine and evaluate several term weighting schemes for the vector space model. Escalante et al. [13] present an approach to improve the performance of classical term-weighting schemes using genetic programming. Their approach outperforms standard schemes, based on an extensive experimental comparison. The authors also compare the Cummins [12] approach over their benchmarks.
Lai et al. [14] use both recurrent and convolutional neural networks to produce a term-weighting scheme that captures semantics from the text. Similarly to word embeddings [15, 16], the authors represent words based on their context and, also, they use skip-grams for text representation. The experimental results show higher values of macro-F1 in comparison to other state-of-the-art methods.
Vilares et al. [17] introduce an unsupervised approach for multilingual sentiment analysis driven by syntax-based rules; the words are weighted based on the analysis of syntax-graphs. The authors provide experimental support for English, Spanish, and German. However, to support an additional language, it needs to implement several rules and a proper syntax parser.
Mozetič et al. [18] study the effect of the agreement among human taggers in the performance of sentiment classifiers. In this way, they compare several classifiers over a traditional text normalization and a vector representation with TFIDF weighting.They provide 14 tagged datasets for European languages; we selected some of them for our benchmarks. See Section 5 for more details.
Author profiling is another important task related to text categorization, where several advances have been proposed. In [19] the authors report their approach to perform author profiling; in particular, they describe the best classifier of the PAN’13 contest that consists on a distributional word representation based on the membership to each class along with a number of text standard text preprocessing, see [20]. Recently, in PAN’17 [21], some current works related to user profiling are presented. In this case, user profiling is related to gender and language-region classification. In this aspect, in [22], an SVM, with linear kernel, in combination with word unigrams, character 3- to 5-grams and POS features are employed. In [23] the features were selected as word and POS n-grams, the number of emojis in the text, document sentiment, character flooding (counting the number of times that three or more identical character sequence appears in the text). Finally, a lexicon of important word is also employed.
3 TC: A Combinatorial Framework for Text Classification
Our approach consists in finding a competitive text classifier for a given task among a (possibly large) set of candidates classifiers. A text classifier is represented by the parameters that determine the classifier’s functionality along with the input dataset. The search of the desired text classifier should be performed efficiently and accurately, in the sense that the final classifier should be competitive concerning the best possible classifier in the defined space of classifiers.
In the first part of this section, we will describe the structure of our approach, that is, we state the parameters defining our configuration space. Then, we define the TC graph, which is the core structure used by the meta-heuristics implemented to find a good performing text classifier for a given task. In the road, we also describe the score function that encapsulates the functionality of the classifier and provides a numerical output necessary to maximize the efficiency of the classifier.
3.1 The configuration space
As mentioned previously, a text classifier consists of well differentiated parts. For our purposes, a classifier has the following parts: i) a list of functions that normalize and transform the input text to the input of tokenizers, ii) a set of tokenizer functions that transform the given text into a multiset of tokens, iii) a function that generates a vector from the multiset of tokens; and finally, iv) a classifier that knows how to assign a label to a given vector. These pieces define a TC space of configurations, which is defined by the tuple . In the following paragraphs a more detailed description is given.
is the space of transformation functions, where is defined as the identity function and a set of related functions, mutually exclusive.555The identity function is defined as . We define the function such that , where the parameter is a text, i.e., a string of symbols. 2. 2.
is the set of tokenizer functions. Each is defined as either a function that returns or a simple tokenizer function, i.e., a tokenizer function is a function that extracts a list of subsequences of the given argument. More precisely, the function is defined; where such that , extracts a list of subsequences of . The final multiset is named as bag of tokens. 3. 3.
is a set of functions that transform a bag of tokens into a vector of dimension , i.e., where is a non empty string, . The proper value of each vector’s coordinate is also determined by ; the later task is commonly known as weighting scheme. 4. 4.
Finally, is a set of functions that create a classifier for a given labeled dataset as knowledge source.
Now, let be the set of all possible configurations of the TC space; therefore, it is defined as follows:
[TABLE]
then, the size of is described by
[TABLE]
Without loss of generality, the size of the search space can be summarized as , where the term captures the effect of s with more than two member functions. This means that is lower bounded by , i.e., all s are binary and both and are singletons. Even on the simplest setup, the configuration space grows exponentially with the number of possible transformations and tokenizers. Thus, in order to find the best item, it is necessary to evaluate the entire space; this is computationally not feasible.666For instance, evaluating each configuration takes about 10 minutes on a commodity workstation; more about this will be detailed in the experimental section. A typical configuration space can contain billions of configurations such that the exhaustive evaluation is not feasible in current computers. To remain as a practical approach, instead of performing an exhaustive evaluation of to find the best configuration, we soften the problem to find a (very) competitive configuration; then it can be solved as a combinatorial optimization problem, in particular, as the maximization of a score function.
3.2 The configuration graph
In order to solve the combinatorial problem with local search-based meta-heuristics, it is necessary to create a graph where the vertex set corresponds to , and the edge set corresponds to the neighborhood of each vertex, . The edges are simply denoted by the neighborhood function , so is a TC graph.
Our main assumption is simple and feasible, the function score slowly varies on similar configurations, such that we can assume some degree of locally concaveness, in the sense that a well-performing local maximum can be reached using greedy decisions at some given point. Even when this is not true in general, the solver algorithm should be robust enough to get a good approximation even when the assumption is valid only with some degree of certainty. To induce the search properties, the neighborhood should be defined in such a way that neighborhoods describe only similar configurations. For this matter, we should define a distance function between configurations. First, we must define a comparison function,
[TABLE]
Since each configuration is a tuple of functions, the Hamming distance over configurations is naturally defined as follows
[TABLE]
Now, we can define , for any and a configuration . However, the number of items grows exponentially with the radius, and therefore, the notion of locality will be rapidly degraded. To maintain the locality, we define the neighborhood as:
[TABLE]
Under this construction scheme, the diameter of is determined by the length of the configuration tuple, i.e., , the diameter determines the number of hops in the TC graph that an optimal opt algorithm will perform, in the worst case. However, since the best configuration is unknown, we must use score as an oracle that leads our navigation at each step.
3.3 The score function
The score function evaluates the performance of the text classifier defined by the configuration with the given training and test sets. Without loss of generality, the evaluation of a configuration can be described by three main steps:
The dataset is divided into and . 2. 2.
The TC algorithm described by learns from . 3. 3.
The prediction performance of is evaluated using the dataset , more details are given below.
These steps can be modified to support cross-validation, schemes like -folds or bagging, which provide a more robust way to measure the performance of a classifier. The details of these measurement strategies are beyond the scope of this manuscript, the interested reader is referenced to Ch. 9 of [24].
Now, please recall from §3.1 that contains the parameters for a number of functions that transform the input text into its associated label. Given a configuration , a classifier is created using the labeled dataset transforming all texts in the training set to its corresponding vector form, i.e., for . Once the classifier is trained, the associated label for all is computed as . Finally, the performance of is computed comparing the predicted labels against the actual ones; a typical score function will use F1 (macro or micro), accuracy, precision, or recall, to measure the quality of the text classifier.
3.4 Optimization process
The core idea to solve the optimization problem is to navigate the graph using a combination of two meta-heuristics. In the following paragraphs, we briefly review the techniques we used to solve the combinatorial problem, a proper survey of the area is beyond the scope of this manuscript. However, the interested reader is referred to [25, 26].
To maintain TC in practical computational requirements, we select two types of fast meta-heuristics, Random Search [27] and Hill Climbing [25, 26] algorithms. The former consists in selecting the best performing configuration among the set randomly chosen from , that is,
[TABLE]
where the size of is an open parameter dependent on the task. On the other hand, the core idea behind Hill Climbing is to explore the configuration’s neighborhood of an initial setup and then greedily update to be the best performing configuration in . The process is repeated until no improvement is possible, that is,
[TABLE]
We improve the whole optimization process applying a Hill Climbing procedure over the best configuration found by a Random Search. We also add memory to avoid a configuration to be evaluated twice.777In principle, this is similar to Tabu search; however, our implementation is simpler than a typical implementation of Tabu search.
Summarizing, the optimization process is driven by the tuple , where i) is the TC space, ii) means the training set of labeled texts, iii) is the function to be maximized, and finally, iv) is a combinatorial optimization algorithm that uses and to find an almost optimal configuration in .
4 Experimental setup
This section describes the general setup used to characterize and compare our method with the related state-of-the-art. In particular, we define the set of functions used to create our TC space; and also, we detail the benchmarks used in the comparison.
All the experiments were run in an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1 Linux. We implemented TC888Available under Apache 2 license at https://github.com/INGEOTEC/microTC on Python. To characterize the performance of TC and compare it to the relevant state of the art, we selected a number of popular benchmarks in the literature; these datasets are described below. It is worth to mention that we bias our selection to benchmarks coming from popular international challenges. With the purpose of avoiding over-fitting, we performed the model selection using score as a -fold cross-validation of the specified performance measure, see Table 1. We decided to use cross-validation for this stage because we observed over-fitting for small datasets, like those found in authorship attribution, when we use a static train-test partitions to perform model selection. A brief experimental study of the effect of the validation schemes is presented in §5.2.
4.1 About our particular TC space
As state before, TC is a framework to create text classifiers searching for best models in a configuration space. This space can be adjusted for any particular problem, but here, we consider a general enough space to match a disparity of benchmarks (listed below in §4.3).
When the knowledge about the domain is low, then a large and generic configuration space should be used. It could be tempting to learn about the domain using the information found by the optimization process; this is clearly possible. However, it is encouraged to take into account that the search process will take decisions to match the particular dataset, not the domain, and any generalization of the knowledge must be curated by an expert in the domain. It is important to mention that large configuration spaces will consume a lot of computational time to be optimized.
On the other hand, a hand-crafted configuration space for a given problem can yield to very fast processing times; however, a vast knowledge of the domain is required to reach this state. In this case, we discard the possibility of discovering new knowledge on the domain and take advantage of the particularities of the dataset that a more general configuration space can provide.
To tackle with the disparate list of benchmarks, we select a generic large configuration space defined in the following paragraphs.
Preprocessing functions
We associate to the following function sets.
hashtag-handlers.
Defined as , the idea is to allow to remove or group into a single tag all hash tags, for and , respectively; the function lets the text unmodified. The format of a hash tag is that introduced by Twitter , but now popular along many data sources.
number-handlers.
Defined as , this function set contains functions to remove, group, or left untouched numbers in the text.
url-handlers.
Defined as , this function set contains functions to remove, group, or left untouched numbers in the text.
usr-handlers.
Defined as , this function set contains functions to remove, group, or left untouched users and host domains in the text. The pattern being tackled is @user this is a popular way to denote users in several social networks; the pattern also matches naturally with the domain part of email addresses.
diacritic-removal.
Defined as , this function set contains functions to remove, or left untouched, diacritic symbols in the text. The objective is to reduce composed symbols like á,ä,ã,â, or à to simply a. This is a well known source of errors in informal text written in languages making hard use of diacritic symbols
duplication-removal.
Defined as , this function set contains functions to remove, or left untouched, duplicated contiguous symbols in the text.
punctuation-removal.
Defined as , this function set contains functions to remove, or left untouched, duplicated punctuation symbols in the text. The list of punctuation symbols includes several symbols like ;,:,.,-,’,",(,),[,],{,},,<,>,?,!, among others.
lower-casing.
Defined as contains functions to normalize the case of the text or left untouched.
The list of tokenizers
After all text normalization and transformation, a list of tokens should be extracted. We use three schemes for our tokenizers.
Word n-grams.
This family of tokenizers firstly tokenizes the text into words, and then, produces tokens for words, i.e. word -grams. An -gram is a string of consecutive words. For example, “The red car is in front of the tree” creates the following 3-grams: The red car, red car is, car is in, is in front, in front of, front of the, of the tree.
Character n-grams.
This family of tokenizers does not assume anything about the text and splits the input text to all -sized substrings, i.e., substrings of characters for a text of characters. For example, the character 4-grams of “I like the red car” are I_li, lik, like, ike, ke_t, e_th, the, the, he_r, e_re, red, red, ed_c, d_ca, _car. We use the symbol _ to show the symbol space.
Skip-grams.
Skip-grams are similar to word n-grams but allowing to skip the middle parts. For example, the skip-grams999Two words, skipping one in the middle of the previous example are I-the, like-red, the-car. The idea behind this family of tokenizers is to capture the occurrence of related words that are separated by some unrelated words.
For this matter, instead of selecting one or another tokenizer scheme, we allow to select any of the available tokenizers, and perform the union of the final multisets of tokens. For instance, our configuration space considers three word n-grams tokenizers (), nine character n-grams (), and three skip-grams and .
Weighting schemes
After we obtained a multiset (bag of tokens) from the tokenizers, we must create a vector space. We selected a small set of frequency filters and the TFIDF scheme to weight the coordinates of the vector. On one hand, we consider a sequential list of filters max-filter and min-filter, and then, we select to use the term frequency (TF) or the TFIDF as weight. For the max-filter we delete all tokens surpassing the frequency threshold of , where max-freq is the maximum frequency in of a token in the collection. We consider four filters, for instance we use . For the min-filter we delete all tokens not reaching the frequency threshold of , for instance we use, . Notice that and does not perform any filtering. So, we have embedded 32 different configurations for weighting.
Classifier
We decide to use a singleton set populated with an SVM with a linear kernel. It is well known that SVM performs excellently for very large dimensional input (which is our case), and the linear kernel also performs well under this conditions. We do not optimize the parameters of the classifier since we are pretty interested in the rest of the process. We use the SVM classifier from liblinear, Fan et al. [28].
On the final configuration space
The task of finding the best model for the space of configurations is hard. The number of possible configurations of is (i.e., four trivalent functions sets and four bivalent function sets). From the above configuration, the number of possible tokenizers is 81; also, we have 32 different weighting combinations. So, the configuration is space contains more than 3.3 million configurations. For instance, a configuration needs close to ten minutes to be evaluated, i.e., a sentiment analysis benchmark with ten thousand tweets. Therefore, an exhaustive evaluation of the configuration space will need up to 64 years. Even implementing it in a large distributed cluster the process needs too much time to complete. Such power of computing is not easily accessible. Nonetheless, if we soften the problem to finding not the best model but an excellent one, we can use an algorithm for combinatorial optimization, as explained in §3.
4.2 On the preparation of the input text
Since TC considers the preprocessing step among its parts, we tried to collect all datasets in raw text, without any kind of preprocessing transformations. This was not possible in the general case, mostly due to the aging of datasets; we consider the following text preparation states, in the style of Cachopo [7]:
- •
the raw text corresponds to the original, non-formatted text
- •
the all-terms converts all text into lowercase, also, all diacritic symbols and punctuation marks are removed, and all spacing symbols are normalized to a single space
- •
the no-short dataset removes all terms having less or equal than three characters
- •
the no-stopwords dataset also removes all non discriminant words for English (adjetives, adverbs, conjunctions, articles, etc.)
- •
finally, after the previous steps, all words were transformed by the Porter’s stemmer for English [29] to generate the stemmed dataset.
For instance, we use the all-terms for R8, R10, R52 and WebKB; for CADE we use the stemmed version. In these cases, we used the datasets prepared by Cachopo [7]. In other cases, we use the raw text. The effect of using one or another state is studied in Section 5.1.
4.3 Benchmark description
The text classification literature has a myriad of datasets, performance measures, and validation schemes. We select several prominent and popular benchmark configurations in the literature; for instance, we select to work with topic classification, spam identification, author profiling, authorship attribution, and sentiment analysis. To avoid implementation mistakes, we directly use the reported performances by the literature; nevertheless, we are restricted to compare under the same circumstances. Table 1 describes the language and number of classes of each dataset; it also describes the kind of validation; in particular, we consider two validation schemes: i) 10-fold cross-validation, and ii) a static train-test partition of the specified sizes. The diversity of benchmarks and validation schemes help us to prove the functionality of our approach in many circumstances.
The Reuters-21578101010http://www.daviddlewis.com/resources/testcollections/reuters21578/ is one of the most used collection for text categorization research. The documents were manually labeled by personnel from Reuters Ltd. The 20Newsgroup111111http://people.csail.mit.edu/jrennie/20Newsgroups dataset is very popular in text classification area and it contains news related to different topics originally collected by Ken Lang. The WebKB dataset121212http://www.cs.cmu.edu/~webkb/ contains university webpages. This dataset is composed of the webpages classified in seven different categories: student, faculty, staff, department, course, project and other. We use the four most popular classes in our experiments. The CADE dataset [7] is another collection of webpages, specifically Brazilian webpages classified by human experts. This collection contains a total of 12 classes, e.g. services, sports, science, education, news, among others. The PU [9] is a collection of emails written in English and other languages, classified as spam and non-spam messages; this collection contains the following datasets: PUA, PU1, PU2 and PU3. Ling-Spam dataset [30] is also a spam dataset. PAN contest [20, 21] has several tasks, between them are author identification and author profiling. The author profiling task is a forensic linguistics problem that consisnts in detecting gender and age for the author (PAN’13). For the PAN’17 age identification task was replaced by the task of determining the language variety of the writter, also, the number of different languages was increased to four. As listed in Table 1, the official dataset is undisclosed, and each algorithm must be evaluated with the TIRA evaluation platform.131313https://tira.io The Authorship Attribution datasets [13] are a set of different types of topics: CCA, NFL, Business, Poetry, Travel and Cricket. The objective of these datasets is to determine the authorship of each document. The Multilingual Sentiment Analysis are a set of tweets in different languages: Arabic, German, Portuguese, Russian, Swedish and Spanish. The purpose of these datasets is classifying each tweet as negative, neutral, or positive polarity.
A detailed description of all these datasets is provided in Table 1, where there can be found some particularities of the dataset like the written language, the number of documents, the kind of evaluation (independent train-test sets or -folds), the number of classes, and the performance measure optimized by TC.
5 Experimental Results
This section is dedicated to comparing our work with the relevant state-of-the-art methods described above. Also, we characterize the generalization power in terms of the validation scheme.
The first task analyzed is authorship attribution, Table 2 shows the macro-F1 and accuracy performances for a set of authorship attribution benchmarks. Here, we compare TC with two term-weighting schemes [13] and [12]. The pre-processing stage of the TC’s input is all-terms; others use the stemmed stage. The best performing classifiers are created by TC, except for NFL where alternatives perform better. In the case of Business, Escalante et. al [13] performs slightly better only in terms of accuracy. Please notice that NFL and Bussiness are among the smaller dataset we tested, the low performance of TC can be produced by the low number of exemplars, while alternative schemes take advantage of the few samples to compute better weights.
In Table 2 the results of PAN’13 competition are presented. According to the contest report [20], the best results were achieved by Pastor, Santosh, and Meina. In this benchmark, TC produces the best result in all average cases. In a fine-grained comparison, only Meina surpasses TC on the gender identification for English.
Table 4 shows the performance of TC in the PAN’17 benchmark. The table also lists the best three results of the challenge, reported as statistically equivalent in [21], these works are detailed in §2. Please note that the result by Tellez et al. [31] was generated with TC but using a special term-weighting scheme based on entropy instead of TFIDF (or TF). The details of the entropy based term-weighting scheme are beyond the scope of this contribution; the interested reader is referenced to [31]. The plain TC, as described in this manuscript, achieves accuracies of 0.7880 and 0.8849, respectively for gender and variety identification. The joint prediction of both classes achieves an accuracy of 0.7038. These score values locate the plain TC in the eighth position in the official rank, see [21].
Table 5 reports the performance over topic classification benchmarks. This experiments considered several news datasets.141414Please refer to Table 1 for the detailed description of each benchmark. Our approach, TC, reaches best results in most of the datasets with exception of News-20 and News-4 where TC reaches second and third best performance.
In sentiment analysis task we compared the datasets reported in [32, 33]. Moreover, we reported the results obtained with the B4MSA approach [34]. B4MSA is a method for multilingual polarity classification considered as a baseline to build more complex approaches151515https://github.com/INGEOTEC/b4msa. It is important to note that from each dataset reported in [32, 33], both approaches, B4MSA and TC, use a subset specified in Table 6; e.g. in Arabic language we used 100%, in German we used 80% of the dataset and so on (all specified in table).
In Table 6, it can be seen that best results were obtained with B4MSA and TC in all the cases, and both results are very close.
Finally, Table 7 shows the results of spam classification task. Here, it can be seen that best results in the macro-F1 measure were obtained with our approach TC; nevertheless, the best results in the accuracy score were achieved by Androutsopoulos et al. [35] except in Ling-Spam dataset where TC reached the best performance.
5.1 About the pre-processing state of the input text
Here, the pre-processing step is analyzed; for this, Table 8 shows different performances that correspond to the News benchmark in various stages of the normalization process, as used as inputs for TC. We found that TC achieves high performances without using additional sophisticated pre-processing steps, almost all of them, language dependent. For instance, using the raw text is just below points than the performance using the stemmed collection. The human intervention to prepare the input text is barely needed by TC without significantly reducing the performance in practice. Alternatively, methods like Escalante et al. [13] and Cachopo [7] need to use the stemmed version of the dataset to achieve its optimal performance, i.e., accuracy values ranging from to , for more details see Table 5.
5.2 On the robustness of the score function
The score function leads the model selection procedure to fulfill the requirements of the task. In this process, it is necessary to determine which precise quality’s measure is needed, e.g., macro-F1 or accuracy. As any learning algorithm, it is necessary to protect the score with some validation schemes to avoid the latent overfitting. On this matter, we consider the use of two validation schemes: i) stratified -folds and ii) a random binary partition of size for the train set and for the test set, for a (training) collection of size .
To learn how to choose the right criteria, we review both the predicted and the actual performance (macro-F1, for instance) of these two validation schemes. The predicted macro-F1 is the performance achieved by the model selection procedure using some of the two mentioned validation schemes. The actual performance is the one obtained directly evaluating the gold-standard collection.
Figure 1 shows the performance of TC on small databases. The stability of -folds in terms of predicted and actual performance is supported by Figures 1(a), 1(c) and 1(e). This is also true for larger datasets like those depicted in Figures 2(a), 2(c) and 2(e). The figures show that even on the TC achieves almost its optimal actual performance; even when the predicted performance is most of the times better for larger values. On the other hand, the binary partition method is prone to overfit, especially on small datasets and small values (i.e., small test sets). For instance, Figure 1(b) shows the performance for NFL; please note how yields to very competitive performances, i.e. higher than 0.9 for both macro-F1 and accuracy. These performances are pretty higher than those achieved by the alternatives (see Table 2); however, yields to low actual performances, contrasting the perfect predicted performance. A similar case happens for the Business dataset, Figure 1(d); but in this case, the actual performance is relatively stable. The behaviour of binary partition in larger dataset is less prone to overfit, like Figures 2(b) and 2(d) illustrate. Nonetheless, the case of R52, Figure 2(f), shows that the overfitting issue is still latent; however, it barely affects the actual performance since the score function is applied to a large enough test set.
As rule of thumb, it is safe to use -fold cross-validation to compute score in the model selection stage. We encourage the use of small values (e.g., 2, 3 or 5) since the actual performance is relatively stable and the computational cost is kept low. Please notice that -folds procedure introduces a factor of to the computational cost of score, and, algorithms to solve the underlying combinatorial optimization problem need to evaluate a considerable number of configurations to achieve good results. In cases where the number of samples is pretty large, or a rapid solution is required, the binary partition method is also a good choice, especially for high values. The later setup corresponds to prepare robust score functions at the cost of reducing the train set in the model selection stage. The reduction of the training set is not a major problem for the actual performance, as it is illustrated by experiments corresponding to binary partition performances, see Figures 1 and 2. Please remember, at this stage, we are just selecting a proper configuration, and in a subsequent step, the final model is computed using this configuration and the entire training dataset .
6 Conclusions
In this work, a minimalistic and global approach to text classification is proposed. Moreover, our approach was evaluated in a broad range of classification tasks such as topic classification, sentiment analysis, spam detection and user profiling; for this matter, a total of databases related with these tasks were employed. In order to evaluate the performance of our approach, the results obtained in each task were compared to the state-of-the-art methods, related to each task. Additionally, we analyze the effect of the pre-processing stage. In this experiment, we observed that our approach is competitive with the alternative methods even using the raw text as input, without a penalty in the performance; therefore, it is possible to use TC to create text classifiers with a little knowledge of natural language processing and machine learning techniques. We also studied some simple strategies to avoid overfitting problem; we consider using a -fold cross-validation scheme and a binary partition to perform the model selection. Based on our experimental observation, our TC can both properly fit the dataset and speedup the construction step using small values in cross-validation schemes and small training sets when we use binary random partitions. We also found that perform -folds can be the preferred validation scheme on small to medium sized datasets, but very large datasets can use the binary partition scheme without a significant reduction of the performance, and also, keeping the cost the entire process low.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. S. Tellez, S. Miranda-Jiménez, M. Graff, D. Moctezuma, R. R. Suárez, O. S. Siordia, A simple approach to multilingual polarity classification in twitter, Pattern Recognition Letters.
- 2[2] A. Khan, B. Baharudin, L. H. Lee, K. Khan, A review of machine learning algorithms for text-documents classification, Journal of advances in information technology 1 (1) (2010) 4–20.
- 3[3] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2 (1) (1987) 37 – 52.
- 4[4] T. K. Landauer, P. W. Foltz, D. Laham, An introduction to latent semantic analysis, Discourse processes 25 (2-3) (1998) 259–284.
- 5[5] J. J. Rocchio, Relevance feedback in information retrieval.
- 6[6] T. Joachims, A probabilistic analysis of the rocchio algorithm with tfidf for text categorization., Tech. rep., DTIC Document (1996).
- 7[7] A. Cardoso-Cachopo, Improving methods for single-label text categorization, Ph.D. thesis, Ph D Thesis, Instituto Superior Técnico, Portugal (2007).
- 8[8] I. Androutsopoulos, G. Paliouras, E. Michelakis, Learning to filter unsolicited commercial e-mail, ” DEMOKRITOS”, National Center for Scientific Research, 2004.
