Mining Statistically Significant Substrings using the Chi-Square Statistic
Mayank Sachan, Arnab Bhattacharya

TL;DR
This paper introduces an efficient algorithm to identify the most statistically significant substring in a sequence using the chi-square statistic, improving computational complexity from quadratic to sub-quadratic, with applications in cryptology, finance, and sports.
Contribution
The paper presents a novel O(n^{3/2}) algorithm for finding the most significant substring based on chi-square, outperforming previous methods with O(n^2) complexity.
Findings
Algorithm runs in O(n^{3/2}) time with high probability
Effective in detecting significant patterns in various domains
Outperforms existing heuristics in experiments
Abstract
The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. We use the chi-square statistic as a quantitative measure of statistical significance. Given a string of characters generated from a memoryless Bernoulli model, the problem is to identify the substring for which the empirical distribution of single letters deviates the most from the distribution expected from the generative Bernoulli model. This deviation is captured using the chi-square measure. The most significant substring (MSS) of a string is thus defined as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Data Management and Algorithms · Data Mining Algorithms and Applications
