Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and   Approaches to Modeling

Elena \'Alvarez-Mellado; Constantine Lignos

arXiv:2203.16169·cs.CL·March 31, 2022

Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

Elena \'Alvarez-Mellado, Constantine Lignos

PDF

Open Access 1 Repo 4 Models

TL;DR

This paper introduces a new annotated corpus of Spanish newswire with unassimilated borrowings and evaluates various sequence labeling models, finding that BiLSTM-CRF with subword embeddings and Transformer-based embeddings outperform multilingual BERT.

Contribution

It provides a large, detailed corpus for borrowing detection and compares multiple models, highlighting effective approaches for identifying unassimilated lexical borrowings in Spanish.

Findings

01

BiLSTM-CRF with subword embeddings performs best.

02

Transformer-based embeddings improve borrowing detection.

03

Corpus is larger and more diverse than previous resources.

Abstract

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings -- words from one language that are introduced into another without orthographic adaptation -- and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lirondos/coalas
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistics, Language Diversity, and Identity · Natural Language Processing Techniques · Text Readability and Simplification