A Scalable Document-based Architecture for Text Analysis
Ciprian-Octavian Truic\u{a}, J\'er\^ome Darmont (ERIC), Julien Velcin, (ERIC)

TL;DR
This paper introduces a scalable, flexible document-based architecture for text analysis that integrates multiple preprocessing techniques and improves performance through efficient indexing, demonstrated with relational and document-oriented databases.
Contribution
A novel generic text analysis architecture combining flexible document structure, multiple preprocessing steps, and efficient indexing, implemented with relational and document-oriented databases.
Findings
Document-oriented implementation outperforms relational databases in scalability.
Flexible architecture supports various preprocessing techniques.
Feasibility demonstrated through experimental evaluation.
Abstract
Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g., stem or lemma extraction, part-of-speech tagging, named entities recognition...), and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. %As a result, no definite solution is currently available. Thus, we propose in this paper a new generic text analysis architecture, where document structure is flexible, many preprocessing techniques are integrated and textual datasets are indexed for efficient access. We implement our conceptual architecture using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Text and Document Classification Technologies
