Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental   Dataset Architecture for Wikipedia Revision History

Lingxi Li; Zonghai Yao; Sunjae Kwon; Hong Yu

arXiv:2410.04410·cs.CL·October 8, 2024

Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History

Lingxi Li, Zonghai Yao, Sunjae Kwon, Hong Yu

PDF

Open Access 1 Video

TL;DR

BloArk is a new architecture designed to efficiently process and incrementally update Wikipedia revision history datasets, reducing computational costs and enabling scalable NLP data preparation.

Contribution

It introduces a novel, cost-effective, and incremental dataset architecture for Wikipedia revision history processing, improving efficiency and scalability.

Findings

01

Reduces processing time for Wikipedia revision datasets

02

Enables incremental updates to existing datasets

03

Open-source implementation available online

Abstract

Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others' works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History· underline

Taxonomy

TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Digital Rights Management and Security