Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers
Frederico Dias Souza, Jo\~ao Baptista de Oliveira e Souza Filho

TL;DR
This paper conducts a comprehensive experimental comparison of embedding methods, from classical to transformer-based, for Brazilian Portuguese text classification, focusing on user reviews and sentiment analysis.
Contribution
It provides the first extensive evaluation of embedding techniques for Brazilian Portuguese reviews, highlighting the effectiveness of fine-tuned transformer models.
Findings
Fine-tuned TLMs achieved the best results across datasets.
Transformer-based models outperform classical approaches.
Open datasets and reproducibility are emphasized.
Abstract
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james · Tanh Activation · Sigmoid Activation · Long Short-Term Memory
