Data-to-Value: An Evaluation-First Methodology for Natural Language   Projects

Jochen L. Leidner

arXiv:2201.07725·cs.CL·January 20, 2022

Data-to-Value: An Evaluation-First Methodology for Natural Language Projects

Jochen L. Leidner

PDF

Open Access

TL;DR

The paper introduces 'Data to Value' (D2V), a new evaluation-first methodology tailored for large-scale natural language processing projects, addressing scalability, unstructured data, and non-technical factors.

Contribution

It presents a novel methodology specifically designed for big data NLP projects, filling gaps left by traditional data mining methodologies.

Findings

01

D2V improves project success rates in NLP at scale.

02

The methodology incorporates a comprehensive question catalog for better project guidance.

03

It bridges technical and non-technical project aspects effectively.

Abstract

Big data, i.e. collecting, storing and processing of data at scale, has recently been possible due to the arrival of clusters of commodity computers powered by application-level distributed parallel operating systems like HDFS/Hadoop/Spark, and such infrastructures have revolutionized data mining at scale. For data mining project to succeed more consistently, some methodologies were developed (e.g. CRISP-DM, SEMMA, KDD), but these do not account for (1) very large scales of processing, (2) dealing with textual (unstructured) data (i.e. Natural Language Processing (NLP, "text analytics"), and (3) non-technical considerations (e.g. legal, ethical, project managerial aspects). To address these shortcomings, a new methodology, called "Data to Value" (D2V), is introduced, which is guided by a detailed catalog of questions in order to avoid a disconnect of big data text analytics project…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Mining Algorithms and Applications · Big Data and Business Intelligence