Contextualization for the Organization of Text Documents Streams
Rui Portocarrero Sarmento, Douglas O. Cardoso, Jo\~ao Gama, Pavel, Brazdil

TL;DR
This paper explores dynamic algorithms like Incremental TextRank and IS-TFIDF, combined with FastText embeddings, to efficiently organize and analyze streams of text documents in real-time, demonstrated on Reuters and COVID-19 datasets.
Contribution
It introduces a novel architecture for streaming text document organization using incremental algorithms and embedding-based similarity, improving processing speed and contextual understanding.
Findings
Incremental algorithms outperform batch methods in streaming contexts.
FastText embeddings enhance document similarity assessments.
The approach effectively clusters large, evolving text datasets.
Abstract
There has been a significant effort by the research community to address the problem of providing methods to organize documentation with the help of information Retrieval methods. In this report paper, we present several experiments with some stream analysis methods to explore streams of text documents. We use only dynamic algorithms to explore, analyze, and organize the flux of text documents. This document shows a case study with developed architectures of a Text Document Stream Organization, using incremental algorithms like Incremental TextRank, and IS-TFIDF. Both these algorithms are based on the assumption that the mapping of text documents and their document-term matrix in lower-dimensional evolving networks provides faster processing when compared to batch algorithms. With this architecture, and by using FastText Embedding to retrieve similarity between documents, we compare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Advanced Clustering Algorithms Research · Complex Systems and Time Series Analysis
MethodsfastText
