Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese
Rodrigo Wilkens, Leonardo Zilio, Aline Villavicencio

TL;DR
This paper introduces a new dataset and intrinsic evaluation tasks to assess how well language models for Brazilian Portuguese capture linguistic phenomena like grammatical structures and multiword expressions, aiding transparency and comparability.
Contribution
It presents a novel dataset and evaluation framework specifically designed to analyze linguistic generalisation in Brazilian Portuguese language models.
Findings
BERTimbau Large outperforms BERTimbau Base and mBERT on MWE tasks.
The dataset effectively distinguishes models' capabilities in capturing linguistic phenomena.
Evaluation reveals strengths and limitations of current models in encoding grammatical structures.
Abstract
Much recent effort has been devoted to creating large-scale language models. Nowadays, the most prominent approaches are based on deep neural networks, such as BERT. However, they lack transparency and interpretability, and are often seen as black boxes. This affects not only their applicability in downstream tasks but also the comparability of different architectures or even of the same model trained using different corpora or hyperparameters. In this paper, we propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena. The dataset that was developed for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsAttention Is All You Need · Test · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Softmax · Layer Normalization · Dropout · Linear Layer · Attention Dropout · Multi-Head Attention
