Enhancing Portuguese Variety Identification with Cross-Domain Approaches
Hugo Sousa, R\'uben Almeida, Purifica\c{c}\~ao Silvano, In\^es, Cantante, Ricardo Campos, Al\'ipio Jorge

TL;DR
This paper introduces a cross-domain language variety identifier for Portuguese, addressing linguistic biases in NLP models by creating a new dataset and evaluating transformer-based classifiers to distinguish European and Brazilian Portuguese.
Contribution
The authors developed the PtBrVarId corpus and demonstrated the effectiveness of transformer models for cross-domain Portuguese variety identification, promoting resource creation for underrepresented language varieties.
Findings
Transformer classifiers effectively distinguish Portuguese varieties across domains.
The PtBrVarId corpus enables better cross-domain variety identification.
Open-sourced resources facilitate further research in language variety classification.
Abstract
Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWine Industry and Tourism · Fermentation and Sensory Analysis · Sensory Analysis and Statistical Methods
