Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information
Maria Clara Ramos Morales Crespo, Maria Lina de Souza Jeannine Rocha,, Mariana Louren\c{c}o Sturzeneker, Felipe Ribas Serras, Guilherme Lamartine de, Mello, Aline Silva Costa, Mayara Feliciano Palma, Renata Morais Mesquita,, Raquel de Paula Guets, Mariana Marques da Silva

TL;DR
This paper introduces Carolina, a comprehensive and annotated corpus of Brazilian Portuguese designed for linguistic and computational research, emphasizing provenance, typology, and versioning to support language modeling and resource development.
Contribution
It presents the first public version of Carolina, detailing its construction methodology, metadata standards, and potential for advancing Portuguese language processing.
Findings
Corpus contains over 653 million tokens.
Texts are annotated with TEI standards.
The corpus supports linguistic and NLP research.
Abstract
This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has tokens, distributed over broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
