Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval
Filippo Geraci, Marco Pellegrini

TL;DR
This paper introduces a new method for dynamic similarity search in semi-structured text databases, significantly improving efficiency and accuracy over previous approaches by embedding weights differently and applying advanced clustering techniques.
Contribution
It proposes an alternative data embedding and clustering approach for dynamic similarity search, outperforming prior methods in speed and quality.
Findings
Significant reduction in query time compared to baseline methods.
Improved tradeoffs between query accuracy and computational efficiency.
Pre-processing time reduced by at least a factor of thirty.
Abstract
Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Image and Video Retrieval Techniques · Algorithms and Data Compression
