Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History
Lingxi Li, Zonghai Yao, Sunjae Kwon, Hong Yu

TL;DR
BloArk is a new architecture designed to efficiently process and incrementally update Wikipedia revision history datasets, reducing computational costs and enabling scalable NLP data preparation.
Contribution
It introduces a novel, cost-effective, and incremental dataset architecture for Wikipedia revision history processing, improving efficiency and scalability.
Findings
Reduces processing time for Wikipedia revision datasets
Enables incremental updates to existing datasets
Open-source implementation available online
Abstract
Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others' works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Digital Rights Management and Security
