A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem
Ciprian-Octavian Truic\u{a}, Neculai-Ovidiu Istrate and, Elena-Simona Apostol

TL;DR
This paper introduces a scalable Spark-based architecture for automatic domain-specific multi-word term recognition, improving text preprocessing and candidate extraction for large textual datasets in NLP.
Contribution
It presents a novel distributed architecture built on Spark, with an analysis of its accuracy and scalability, and provides an easy-to-integrate Python implementation.
Findings
Proven feasibility through experiments on real-world datasets
Demonstrated improved scalability and accuracy
Enabled Big Data processing for NLP tasks
Abstract
Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques
