The Statistical Dictionary-based String Matching Problem
M. Suri, S. Rini

TL;DR
This paper introduces a statistical and information-theoretic framework for the Dictionary-based String Matching problem, analyzing retrieval efficiency when using k-grams and prefix-free codes.
Contribution
It formulates the DSM problem in a statistical context, enabling analysis of retrieval performance with coded strings and statistical descriptions of source and query.
Findings
Defined retrieval efficiency as average query cost for long sequences.
Analyzed performance for k-grams and prefix-free codes.
Provided theoretical insights into optimal coding strategies.
Abstract
In the Dictionary-based String Matching (DSM) problem, a retrieval system has access to a source sequence and stores the position of a certain number of strings in a posting table. When a user inquires the position of a string, the retrieval system, instead of searching in the source sequence directly, relies on the the posting table to answer the query more efficiently. In this paper, the Statistical DSM problem is a proposed as a statistical and information-theoretic formulation of the classic DSM problem in which both the source and the query have a statistical description while the strings stored in the posting sequence are described as a code. Through this formulation, we are able to define the efficiency of the retrieval system as the average cost in answering a users' query in the limit of sufficiently long source sequence. This formulation is used to study the retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Machine Learning and Algorithms
