Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation
Valentin Barriere, Alexandra Balahur

TL;DR
This paper presents a method using multilingual transformers and automatic translation for data augmentation to improve sentiment analysis on non-English tweets, especially when annotated data is scarce.
Contribution
The paper introduces a novel approach combining multilingual transformers pre-trained on English tweets with automatic translation for data augmentation to enhance non-English tweet sentiment analysis.
Findings
Improved sentiment analysis accuracy in French, Spanish, German, and Italian tweets.
Effective data augmentation technique for low-resource non-English languages.
Multilingual transformers outperform traditional models on small tweet datasets.
Abstract
Tweets are specific text data when compared to general text. Although sentiment analysis over tweets has become very popular in the last decade for English, it is still difficult to find huge annotated corpora for non-English languages. The recent rise of the transformer models in Natural Language Processing allows to achieve unparalleled performances in many tasks, but these models need a consequent quantity of text to adapt to the tweet domain. We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages. Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
