Recognizing the vocabulary of Brazilian popular newspapers with a free-access computational dictionary
Maria Jos\'e Finatto (UFRGS), Oto Vale (UFSCar), Eric Laporte (LIGM)

TL;DR
This study evaluates the effectiveness of two versions of a free computational dictionary in recognizing vocabulary from Brazilian newspapers, revealing coverage gaps and suggesting improvements for linguistic analysis.
Contribution
It provides a critical comparison of DELAF PB 2004 and 2015 in recognizing newspaper vocabulary, highlighting coverage limitations and proposing methods for enhancement.
Findings
Approximately 19% of DG types not in DELAF PB 2004 or 2015.
Coverage in MA is about 13%.
Dictionary version changes slightly affect recognition performance.
Abstract
We report an experiment to check the identification of a set of words in popular written Portuguese with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other researches, which justifies critical study. The vocabulary comes from the PorPopular corpus, made of popular newspapers Di{\'a}rio Ga{\'u}cho (DG) and Massa! (MA). From DG, we retained a set of texts with 984.465 words (tokens), published in 2008, with the spelling used before the Portuguese Language Orthographic Agreement adopted in 2009. From MA, we examined papers of 2012, 2014 e 2015, with 215.776 words (tokens), all with the new spelling. The checking involved: a) generating lists of words (types) occurring in DG and MA; b) comparing them with the entry lists of both versions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
