Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber

TL;DR
This paper introduces CACO, a cross-lingual document classification framework that leverages subword similarities at the character level to improve low-resource language text classification by transferring knowledge from related languages.
Contribution
The paper proposes a novel character-based embedding method that exploits subword similarities for cross-lingual transfer, enhancing low-resource text classification performance.
Findings
Character-level transfer is more data-efficient than word-level transfer.
Joint training of embedder and classifier improves accuracy.
Multi-task objectives further enhance model performance.
Abstract
Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (CACO) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
