HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
Sara B. Coutinho, Rafael M.O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti

TL;DR
This paper introduces HSFN, a hierarchical classifier selection method that enhances fake news detection by maximizing diversity and performance in ensemble models, outperforming existing methods on multiple datasets.
Contribution
The paper presents a novel hierarchical selection approach for classifiers that improves ensemble diversity and accuracy in fake news detection tasks.
Findings
Achieves highest accuracy on two of six datasets.
Effectively balances diversity and performance in classifier selection.
Outperforms state-of-the-art baselines in ensemble construction.
Abstract
Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into…
| Experimental element | Items |
|---|---|
| CV, TF-IDF, W2V, GLOVE, FAST | |
| SVM, LR, RF, NB, MLP, ET, CNN, KNN | |
| Meta-classifier | LR, RF, NB |
| Dataset | Liar, Senti, Covid, Fa-Kes, Ott, Kaggle |
| Number of Labels | 6 , 2 |
| Contexts | Politics, Health, Syrian War, Tourism, Diverse topics |
| Dataset | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Liar | MLP-FAST | SVM-W2V | C-RF (40) | NB-CV |
| Senti | D-NB (1) | D-NB (1) | D-LR (5) | D-LR (5) |
| Covid | KNN-CV | KNN-CV | CNN-GLOVE | C-RF (40) |
| Fa-Kes | RF-TFIDF | RF-TFIDF | SVM-FAST, | B-CV-LR (8) |
| LR-FAST, | ||||
| LR-W2V | ||||
| Ott | NB-TFIDF | D-RF (20) | D-RF (10) | D-RF (20) |
| Kaggle | C-RF (40) | C-RF (40) | EXTRA-TFIDF | C-RF (40) |
| Dataset | Approach | Result | Baseline | Elbow | HSFN |
|---|---|---|---|---|---|
| Liar | Hybrid CNN [7] | 0,274 | 0,167 | 0,232 | 0,241 |
| Senti | BERT-Base | 0,700 | 0,500 | 0,664 | 0,677 |
| + CNN [13] | |||||
| Covid | CNN-LSTM [3] | 0,930 | 0,500 | 0,898 | 0,931 |
| Fa-Kes | CNN-RNN [8] | 0,600 | 0,500 | 0,499 | 0,526 |
| (0,037) | |||||
| Ott | LIWC + Bigrams | 0,898 | 0,500 | 0,800 | 0,865 |
| + SVM [26] | (0,033) | ||||
| Kaggle | RF [10] | 0,950 | 0,500 | 0,931 | 0,987 |
| (0,026) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Spam and Phishing Detection · Big Data and Digital Economy
HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
Sara B. Coutinho1, Rafael M.O. Cruz 2, Francimaria R. S. Nascimento1 and George D. C. Cavalcanti1 This work was supported by the Brazilian CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, in portuguese).1Sara B. Coutinho, Francimaria R. S. Nascimento, George D. C. Cavalcanti are with Centro de Informática (CIn), Universidade Federal de Pernambuco (UFPE), Av. Jornalista Anibal Fernandes s/n, Recife, Brazil email: {sbc2, frsn2, gdcc}@cin.ufpe.br2Rafael M. O. Cruz is with École de Technologie Supérieure (ÉTS), Université du Québec, 1100 Notre-Dame St W, Montreal, Quebec, Canada email: [email protected]
Abstract
Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning–based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers—selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifier’s performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project’s repository: https://github.com/SaraBCoutinho/HSFN.
I INTRODUCTION
The widespread adoption of the Internet, the emergence of new communication channels, and the advancement of artificial intelligence technologies have significantly improved access to information, offering various benefits to society. However, these same developments have facilitated the rapid dissemination of Fake News (FN), raising serious concerns as AI-driven algorithms amplify sensational content to maximize user engagement [25]. Several societal domains, including politics and public health, have been negatively affected. For instance, according to [1], the 2016 U.S. presidential election may have been influenced by the spread of FN, as individuals tend to perceive information that aligns with their beliefs as accurate—a phenomenon rooted in political bias. Similarly, during the COVID-19 pandemic, FN related to the virus spread widely, contributing to public misinformation during a critical global crisis [3]. These impacts demonstrate the importance of developing effective strategies to detect and prevent FN dissemination.
Numerous studies have proposed machine learning-based fact-checking systems to address FN detection [24]. Among these, ensemble learning has emerged as a promising approach by combining the predictive power of multiple classifiers. As shown in [21], integrating diverse feature representations and learning algorithms can enhance the performance of such systems. Nevertheless, ensuring classifier diversity remains a challenge, as redundant models can limit the ensemble’s effectiveness. In fact, recent work demonstrates that combining all models leads to inferior results [23]. Redundancy among classifiers may result in correlated errors, reducing the ensemble’s ability to generalize and detect novel patterns in the data. To address this, [4] highlights the importance of analyzing the information dissimilarity between classifiers. In this context, [6] introduces a Multiple Classifier System (MCS) that relies on manual selection of classifiers from visually identified groups in a two-dimensional space called the Classifier Projection Space [20]. While effective, this manual approach is prone to subjective bias and may fail to identify the most diverse combination of classifiers, reinforcing the need for automated selection strategies.
To address the challenges associated with constructing effective Multiple Classifier Systems (MCS) for FN detection, this work introduces the Hierarchical Selection for Fake News Detection (HSFN) framework. HSFN automatically builds ensembles by combining heterogeneous classifiers generated from diverse feature representation techniques and classification algorithms. During the training phase, a pool of classifiers is created, and their prediction behaviors on a validation set are analyzed using a diversity measure to compute a dissimilarity matrix. A hierarchical clustering process is then applied to group classifiers based on their diversity, forming a dendrogram that organizes classifiers according to their similarities. To select an effective and diverse subset, the proposed HSFN systematically explores different levels of the dendrogram, partitioning it into clusters and selecting representative classifiers from each group. By modeling classifier diversity hierarchically and automating the selection process, HSFN constructs robust ensembles while avoiding manual selection strategies from previous works [6, 35].
Hence, the main contributions of this work are: (1) a comprehensive evaluation of MCS for FN detection; (2) an exploration of classifier dissimilarity at different levels of granularity to enhance ensemble diversity and performance; (3) the development of an automated selection strategy to identify the most diverse classifier pairs, considering both feature representations and algorithms.
II RELATED WORKS
II-A Feature representation for FN detection
The development of various feature representation techniques over time has significantly influenced the methods used in fake news (FN) detection. In [30], the authors employed an attention-based representation using BERT [19] to capture semantic relationships by considering all elements of a sentence. Similarly, in [34], the researchers investigated both sparse and dense representations, including TF-IDF and W2V [17]. These were used to model temporal features through Bi-LSTM and spatial features via Named-Entity Recognition (NER) and GloVe embeddings [18]. The resulting feature vectors were then combined, leading to improved FN detection performance. However, each feature representation technique presents distinct strengths and limitations that must be considered when selecting an appropriate method.
It is important to account for the specific characteristics of each technique. According to [32], sparse vector representations such as TF-IDF, which rely on word frequency, are less suitable for large corpora. In contrast, dense representations like W2V are more appropriate for longer documents as they better capture contextual information. Additionally, GloVe provides dense representations based on global co-occurrence statistics, offering an effective alternative for capturing semantic meaning. Attention-based representations, such as those generated by BERT, have demonstrated strong performance in capturing contextual word meanings and exhibit greater flexibility [32, 33]. However, they require substantial computational resources, larger amounts of data, and transfer learning with labeled data for specific downstream tasks.
II-B Learning algorithm for FN detection
Several works have used multiple algorithms to learn patterns for FN detection. In [9], the authors explored two hybrid systems approaches. The first adopted a classifier system composed of classical machine learning algorithms, SVM, and deep learning algorithms, CNN, and DNN. The second explored only deep learning algorithms, CNN, and Bi-LSTM. Alternatively, [10] used an ensemble system that combined the algorithms instead of joining them one after the other in a sequence to leverage collective learning simultaneously. In [31], the researchers utilised a stacking model using SVM, LR, and RF classifier algorithms and XGBoost as meta-classifiers. [11] highlights the effectiveness of ensemble systems with heterogeneous algorithms, demonstrating that stacking outperforms hybrid systems that employ sequential algorithmic combinations.
II-C MCS for FN detection
In [6], researchers investigated the diversity in ensemble systems using an MCS composed of various techniques for feature representations, such as sparse and dense vector representations, and distinct learning algorithms, such as statistical, symbolic, ensemble and deep learning, and showed the advantage of adopting a heterogeneous and diverse pool of classifiers of MCS. In addition, [23] integrates the use of distinct feature representation techniques in a way to achieve a more consistent representation and demonstrates that the selected system of views to the final representation that feeds the algorithm is accurate in relation to only one representation technique.
From both works, it is relevant to highlight that the exploitation of techniques from these two groups enriches the ensemble analysis in a search for diversity and is a successful approach to find it. Nevertheless, [6] makes a manual selection of a pool that requires visual selection and human efforts. Moreover, to the need for automating the selection of the pool, [23] shows that there is a more informative subset than the use of a combination of all feature representations, which is still an open problem. Our work addresses these limitations through an automated hierarchical selection approach that systematically identifies diverse classifier subsets while preserving the performance benefits established in prior work.
III PROPOSED METHOD: HSFN
The Hierarchical Selection for Fake News detection (HSFN) method constructs a diverse Multiple Classifier System (MCS) via hierarchical clustering. As illustrated in Figure 1, HSFN operates in two phases: (1) training and (2) testing. In the training phase, the dataset (training data) is used to generate a pool of classifiers , which is then applied in the testing phase to predict labels for unseen data (test data). HSFN’s novelty lies in its hierarchical clustering-based selection mechanism, which systematically maximizes classifier diversity. The proposed heuristic, , groups classifiers by dissimilarity and selects an optimal subset for MCS construction. Below, we detail each phase.
III-A Training phase
In phase (1), we preprocess the text sentences in and encode class labels numerically. Let denote the set of feature extraction methods (e.g., TF-IDF, BERT embeddings) that convert text into numerical representations, and the set of classification algorithms (e.g., SVM, Random Forest). For each pair , we train a classifier that learns patterns from the features and labels. This yields a pool of classifiers. To mitigate overfitting (as in [6]), each classifier is evaluated on validation data. We then compute a dissimilarity matrix, where each entry captures the pairwise dissimilarity between two classifiers’ predictions. Finally, we select a diverse subset from and integrate it into a final MCS for test-time prediction.
The automatic selection mechanism consists of two steps of the Subsets selection step: (1.1) hierarchical grouping and (1.2) application of the heuristic. In step (1.1), we apply hierarchical clustering to group classifiers based on their dissimilarity, producing a dendrogram. The clustering uses a linkage function (e.g., Complete, Single, Average, or Centroid [22]) to merge clusters iteratively. In step (1.2), dynamically cuts the dendrogram at varying hierarchy levels, extracting subsets of maximally diverse classifiers. Unlike [6], which relies on manual selection of classifiers through visualization, our method automates this process by systematically evaluating diversity at each hierarchical level and selecting optimal ensembles without human intervention.
The algorithm analyzes dendrogram levels to select classifier subsets. Figure 2 demonstrates this process with an auxiliary line traversing the hierarchy vertically, revealing three classifier groups: (1) BERT-SVM, (2) TFIDF-SVM, and (3) GLOVE-LR paired with CV-NB, clustered by prediction dissimilarity. From each group, we select the top-performing classifier based on evaluation metrics (e.g., choosing GLOVE-LR over CV-NB). The resulting ensemble combines complementary capabilities: BERT-SVM for contextual analysis, TFIDF-SVM for lexical patterns, and GLOVE-LR for short-text bias detection. This selection method ensures diversity while optimizing performance, with semantic differences between classifiers informing critical design choices like linkage functions, particularly valuable for distinguishing nuanced cases like satire versus malicious intent.
Algorithm 1 formally describes . The algorithm takes as input a dissimilarity matrix and performance metrics for all classifier pairs . Unlike [6], which requires manual threshold selection, our algorithm automatically evaluates all possible hierarchy levels ( from 1 to ). At each level, it: (1) partitions classifiers into clusters using hierarchical clustering, (2) selects the highest-performing classifier from each cluster based on metrics, and (3) adds these classifiers to the MCS candidate set. To (1), a function called f_cluster is employed to identify flat clusters at a given level and to return the cluster containing the classifier. The function receives the number of clusters at the specified level and applies a criterion, such as distance, to group proximate elements within the same cluster. This exhaustive search ensures optimal diversity-accuracy trade-offs across all possible ensemble sizes, eliminating the subjectivity inherent in visual inspection approaches.
III-B Testing phase
In phase (2), each classifier subset selected by is integrated into a final ensemble using stacking. The validation predictions from all classifiers in a subset are used to train a meta-classifier, which then generates the final predictions for the test data . This approach enables adaptive weighting of classifier contributions based on their validation performance, resulting in improved generalization compared to individual classifiers or static combination rules [16].
Therefore, based on these two phases, HSFN provides a final classifier that reflects the diversity obtained from the pool selected at the hierarchical level. In this context, we look to see if the HSFN performance is comparable to other heuristic methods or to monolithic classifiers’ performance. These points will be further examined and discussed by analyzing the experimental results.
IV EXPERIMENTAL SETUP
Table I summarizes our experimental configuration. We selected five feature representations (EE) spanning different approaches: CountVectorizer (CV) and TF-IDF for traditional bag-of-words modeling, and Word2Vec (W2V), GloVe, and FastText for distributed word embeddings. These techniques were chosen based on their established effectiveness in fake news detection [8, 9, 12] and their ability to provide diverse feature spaces for classifier analysis. While transformer-based models like BERT offer superior contextual understanding, we excluded them from this study due to their substantial computational requirements for ensemble systems. The selected representations provide sufficient diversity for analyzing classifier behavior while maintaining practical computational efficiency.
For classification algorithms (), we implement eight approaches spanning all major paradigms: Support Vector Machines (SVM), Logistic Regression (LR), Random Forest (RF), Extra Trees (ET), Naïve Bayes (NB), Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), and K-Nearest Neighbors (KNN), following [6, 9, 10]. This creates 40 classifier variants through the full combination of representations and algorithms . Hyperparameters for traditional algorithms use scikit-learn defaults, while MLP and CNN architectures follow [6], with output layers adapting to each dataset’s class count (2-6 neurons). We evaluate performance using accuracy, precision, recall, and F1-score, consistent with [9, 10, 13, 26].
For classifier selection, we employ the double-fault diversity metric, which measures the proportion of instances where both classifiers in a pair make incorrect predictions. This choice is motivated by [5], which demonstrates that its inverse correlates with ensemble accuracy, making it particularly suitable for selecting complementary classifiers. The subsequent ensemble combination uses stacking implemented via DESlib [27], with three alternative meta-classifiers: Logistic Regression (LR), Random Forest (RF), and Naïve Bayes (NB). These meta-classifiers were selected based on their proven effectiveness in similar ensemble frameworks [14, 15].
We evaluate on six datasets used for the fake news detection, chosen to cover varied contexts, for a more robust evaluation. The Liar dataset [7] has six labels for political news. The Senti dataset [13] uses the same political context as Liar but with two labels. The Covid dataset [2] is about health. The Fa-kes dataset covers the Syrian war. The Ott dataset [26, 29] is about tourism. The Kaggle dataset [10] includes diverse topics. As in [36], we preprocess the text by removing punctuation and contractions, fixing spelling, deleting URLs and IP addresses, making words lowercase, stemming, removing stopwords, and deleting words that appear only once [8]. For Fa-kes, Ott, and Kaggle datasets, similar to [10].
V RESULTS
V-A Comparison of Selection Strategies
We evaluate HSFN against monolithic classifiers and three alternative selection heuristics from [6]: Group A (classifiers from a single algorithm), Group B (classifiers using a single representation technique), Group C (all monolithic classifiers combined), and our proposed Group D (HSFN-selected classifiers). Each group is identified by its letter designation, followed by the specific technique (for Groups A and B) and the meta-classifier employed. Table II compares their performance across all datasets and metrics, where results for Fa-kes, Ott, and Kaggle represent 10-fold averages. The analysis prioritizes solutions that achieve comparable accuracy with fewer classifiers, with the number in parentheses indicating the classifier count for each approach.
V-B Comparison with the state-of-the-art
We evaluate HSFN against three comparison approaches: (1) existing methods from the literature, (2) an Elbow selection method, and (3) a Baseline classifier. Following [28], the Elbow method determines the optimal number of classifier groups by analyzing the balance between inter-group distance and intra-group cohesion. The Baseline represents random classification performance, calculated as 100% divided by the number of classes, yielding 50% accuracy for binary classification (2-class datasets) and approximately 16.7% for six-class problems. Table III presents the comparative results across all approaches
V-C Discussion
The performance comparison in Table II reveals that monolithic classifiers achieved the highest results in 45.8% of cases, followed by the proposed method (29.2%), Group C (20.8%), and Group B (4.2%). While Group A did not outperform any other approach, the proposed method demonstrated superior performance compared to Groups A-C, confirming the effectiveness of our selection heuristic. Notably, the proposed method accomplished these results using only 20 classifiers (50% of all available monolithic classifiers), highlighting the value of selective diversity over exhaustive combinations. However, the results also show that certain monolithic classifiers can outperform all ensemble groups, including our proposed method, in specific contexts.
The proposed method’s key advantage lies in its consistent generalization across diverse datasets while maintaining competitive performance with reduced computational requirements. This demonstrates an important trade-off in ensemble design: while individual classifiers may achieve peak performance in specific cases, systematic diversity selection yields more robust performance overall. The method’s efficiency is particularly notable, as it explores only 40 possible combinations rather than the 1,099,511,627,775 potential combinations for MCS construction.
Comparative results with existing literature approaches (Table III) show that the proposed method consistently outperformed baseline implementations and achieved superior accuracy on the Covid and Kaggle datasets, while some literature methods achieved higher performance on other datasets, particularly [7, 13], which incorporated additional non-text features for the Liar and Senti, respectively. These variations likely stem from differences in dataset characteristics, including size, content domain, and class distribution. Despite these factors, the proposed method demonstrates competitive performance across all evaluation scenarios, confirming its viability for practical fake news detection applications.
VI CONCLUSION
This work introduced the method HSFN for selecting a diverse subset of classifiers for FN detection using hierarchical grouping and a double-fault diversity metric. By leveraging heterogeneous classifiers trained on multiple algorithms and feature representations and exploring the hierarchical levels, the approach improved classification performance while ensuring diversity. Experimental results on six datasets showed that the proposed method outperformed or matched existing selection heuristics, including baseline models and the Elbow approach.
These findings highlight the effectiveness of diversity-aware selection in MCS for FN detection, demonstrating its potential to enhance robustness across different datasets and classification tasks. Future work includes expanding the method to multilingual datasets, refining feature representations, and optimizing the integration step through multi-level stacking.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Allcott, Hunt, and Matthew Gentzkow. ”Social media and fake news in the 2016 election.” Journal of economic perspectives 31.2 (2017): 211-236.
- 2[2] Patwa, Parth, et al. ”Fighting an infodemic: Covid-19 fake news dataset.” International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Cham: Springer International Publishing, 2021.
- 3[3] Surendran, Pranav, et al. ”Covid-19 fake news detector using hybrid convolutional and Bi-lstm model.” 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, 2021.
- 4[4] Ludmila, I. Combining pattern classifiers: methods and algorithms. Wiley, 2004.
- 5[5] Kuncheva, Ludmila I., and Christopher J. Whitaker. ”Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.” Machine learning 51.2 (2003): 181-207.
- 6[6] Cruz, Rafael MO, Woshington V. de Sousa, and George DC Cavalcanti. ”Selecting and combining complementary feature representations and classifiers for hate speech detection.” Online Social Networks and Media 28 (2022): 100194.
- 7[7] Wang, William Yang. ”” liar, liar pants on fire”: A new benchmark dataset for fake news detection.” ar Xiv preprint ar Xiv:1705.00648 (2017).
- 8[8] Nasir, Jamal Abdul, Osama Subhani Khan, and Iraklis Varlamis. ”Fake news detection: A hybrid CNN-RNN based deep learning approach.” International journal of information management data insights 1.1 (2021): 100007.
