Natural Language Processing using Hadoop and KOSHIK
Emre Erturk, Hong Shi

TL;DR
This paper discusses building a natural language processing platform using Hadoop and KOSHIK, detailing its architecture, data analysis steps, and evaluating its performance and limitations.
Contribution
It presents a comprehensive guide to constructing a KOSHIK NLP platform with Hadoop, integrating tools like Stanford CoreNLP and OpenNLP, and evaluates its effectiveness.
Findings
KOSHIK effectively processes large-scale wiki data.
The architecture has notable advantages in scalability.
Performance improvements are recommended for better efficiency.
Abstract
Natural language processing, as a data analytics related technology, is used widely in many research areas such as artificial intelligence, human language processing, and translation. At present, due to explosive growth of data, there are many challenges for natural language processing. Hadoop is one of the platforms that can process the large amount of data required for natural language processing. KOSHIK is one of the natural language processing architectures, and utilizes Hadoop and contains language processing components such as Stanford CoreNLP and OpenNLP. This study describes how to build a KOSHIK platform with the relevant tools, and provides the steps to analyze wiki data. Finally, it evaluates and discusses the advantages and disadvantages of the KOSHIK architecture, and gives recommendations on improving the processing performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Semantic Web and Ontologies
