Compiling and Processing Historical and Contemporary Portuguese Corpora

Marcos Zampieri

arXiv:1710.00803·cs.CL·October 3, 2017·2 cites

Compiling and Processing Historical and Contemporary Portuguese Corpora

Marcos Zampieri

PDF

Open Access

TL;DR

This paper details the framework for processing large Portuguese corpora, including pre-processing, annotation, and querying methods, covering both contemporary and historical texts from Brazil, Portugal, and the 16th-20th centuries.

Contribution

It introduces a comprehensive framework for processing and analyzing diverse Portuguese corpora, including historical texts, with detailed methods and applications.

Findings

01

Effective pre-processing and annotation techniques for Portuguese corpora

02

Development of indexing and querying methods for large texts

03

Use of corpora in published research papers

Abstract

This technical report describes the framework used for processing three large Portuguese corpora. Two corpora contain texts from newspapers, one published in Brazil and the other published in Portugal. The third corpus is Colonia, a historical Portuguese collection containing texts written between the 16th and the early 20th century. The report presents pre-processing methods, segmentation, and annotation of the corpora as well as indexing and querying methods. Finally, it presents published research papers using the corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling