Exploiting Cross-Lingual Subword Similarities in Low-Resource Document   Classification

Mozhi Zhang; Yoshinari Fujinuma; Jordan Boyd-Graber

arXiv:1812.09617·cs.CL·April 29, 2020·1 cites

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber

PDF

Open Access

TL;DR

This paper introduces CACO, a cross-lingual document classification framework that leverages subword similarities at the character level to improve low-resource language text classification by transferring knowledge from related languages.

Contribution

The paper proposes a novel character-based embedding method that exploits subword similarities for cross-lingual transfer, enhancing low-resource text classification performance.

Findings

01

Character-level transfer is more data-efficient than word-level transfer.

02

Joint training of embedder and classifier improves accuracy.

03

Multi-task objectives further enhance model performance.

Abstract

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (CACO) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies