Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19
Karla Sch\"afer, Jeong-Eun Choi, Inna Vogel, Martin Steinebach

TL;DR
This paper evaluates BERTopic's effectiveness for multilingual fake news analysis on Covid-19 data, optimizing its parameters and comparing results across languages to identify thematic similarities and differences.
Contribution
It provides a practical analysis of BERTopic's application in multilingual fake news detection, including hyperparameter tuning and real-world dataset evaluation.
Findings
BERTopic effectively identified thematic similarities between US and German Covid-19 fake news.
Distinguishing fake news topics in Indian data proved more challenging.
Optimal hyperparameters vary across languages and datasets.
Abstract
Topic modeling is frequently being used for analysing large text corpora such as news articles or social media data. BERTopic, consisting of sentence embedding, dimension reduction, clustering, and topic extraction, is the newest and currently the SOTA topic modeling method. However, current topic modeling methods have room for improvement because, as unsupervised methods, they require careful tuning and selection of hyperparameters, e.g., for dimension reduction and clustering. This paper aims to analyse the technical application of BERTopic in practice. For this purpose, it compares and selects different methods and hyperparameters for each stage of BERTopic through density based clustering validation and six different topic coherence measures. Moreover, it also aims to analyse the results of topic modeling on real world data as a use case. For this purpose, the German fake news…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection
