Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Deepak Uniyal, Md Abul Bashar, Richi Nayak

TL;DR
This paper compares various cross-lingual classification methods to improve topic discovery in multilingual social media data, focusing on hydrogen energy discussions across four languages over a decade.
Contribution
It systematically evaluates four cross-lingual approaches for filtering relevant social media content and extracting topics, providing insights into their trade-offs and effectiveness.
Findings
Translation-based models improve relevance filtering in some languages.
Multilingual transformers offer direct cross-lingual classification without translation.
Hybrid strategies balance translation and multilingual approaches for optimal results.
Abstract
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Computational and Text Analysis Methods · Text and Document Classification Technologies
