Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence
Yangqiu Song, Stephen Mayhew, Dan Roth

TL;DR
This paper introduces a method for cross-lingual dataless document classification tailored for languages with small Wikipedia presence, leveraging large-Wikipedia languages as bridges to improve classification accuracy without training data.
Contribution
It proposes a novel approach using language similarity and dictionary-based translation to enable effective classification for low-resource languages with minimal Wikipedia content.
Findings
Improved classification accuracy for Small-Wikipedia languages
Comparable performance to the best possible language bridges
Effective use of language similarity metrics for LWL selection
Abstract
This paper presents an approach to classify documents in any language into an English topical label space, without any text categorization training data. The approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on mapping the English labels or short category description into a Wikipedia-based semantic representation, and on the use of the target language Wikipedia. Consequently, performance could suffer when Wikipedia in the target language is small. In this paper, we focus on languages with small Wikipedias, (Small-Wikipedia languages, SWLs). We use a word-level dictionary to convert documents in a SWL to a large-Wikipedia language (LWLs), and then perform CLDDC based on the LWL's Wikipedia. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision heavy approach and can be done for about 100…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
