Automatic Extraction of the Romanian Academic Word List: Data and Methods
Ana-Maria Bucur, Andreea Dinc\u{a}, M\u{a}d\u{a}lina Chitez, Roxana, Rogobete

TL;DR
This paper details the development of the first Romanian Academic Word List (Ro-AWL) using corpus linguistics and computational methods, providing a resource for education and NLP in Romanian.
Contribution
It introduces a novel methodology for creating an academic word list for Romanian, combining existing and new data sources, filling a significant resource gap.
Findings
Ro-AWL distribution aligns with previous research patterns
The list is freely available for multiple applications
Methodology can be adapted for other languages
Abstract
This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Lexicography and Language Studies · Second Language Acquisition and Learning
