Optimizing a Data Science System for Text Reuse Analysis
Ananth Mahadevan, Michael Mathioudakis, Eetu M\"akel\"a, Mikko, Tolonen

TL;DR
This paper presents ReceptionReader, a system optimized for large-scale text reuse analysis in historical corpora, demonstrating the effectiveness of different database and processing frameworks for various analysis tasks.
Contribution
The paper introduces ReceptionReader, a novel system for large-scale text reuse analysis, and provides an extensive evaluation of database and processing framework trade-offs.
Findings
MariaDB Aria is optimal for most analysis workloads.
Apache Spark is essential for all processing stages.
Efficient system design enables large-scale historical text reuse analysis.
Abstract
Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks. In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. The system is built upon billions of instances of text reuses from large digitized corpora of 18th-century texts. Its main functionality is to perform downstream text reuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies · Data Mining Algorithms and Applications
