# Near-Duplicate Text Alignment under Weighted Jaccard Similarity

**Authors:** Yuheng Zhang, Miao Qiao, Zhencan Peng, Dong Deng

arXiv: 2509.00627 · 2025-09-03

## TL;DR

This paper introduces MONO, a novel weighted Jaccard similarity approach for near-duplicate text alignment that guarantees accuracy, is optimal within its framework, and significantly outperforms existing methods in efficiency and scalability.

## Contribution

MONO is the first method to support weighted Jaccard similarity with optimality guarantees using consistent weighted sampling in a hash-based framework.

## Key findings

- MONO reduces index construction time by up to 26x.
- It decreases index size by up to 30%.
- It improves query latency by up to 3x.

## Abstract

Near-duplicate text alignment is the task of identifying, among the texts in a corpus, all the subsequences (substrings) that are similar to a given query. Traditional approaches rely on seeding-extension-filtering heuristics, which lack accuracy guarantees and require many hard-to-tune parameters. Recent methods leverage min-hash techniques under a hash-based framework: group subsequences by their min-hash, and for any query, find all sketches similar to the query's sketch. These methods guarantee to report all subsequences whose estimated unweighted Jaccard similarity with the query exceeds a user-provided threshold and are efficient. However, they fail to account for token importance or frequency, which limits their use in real scenarios where tokens carry weights, such as TF-IDF. To address this, we propose MONO, an approach that supports weighted Jaccard similarity using consistent weighted sampling. MONO achieves optimality within the hash-based framework. For example, when token weights are proportional to frequencies, MONO generates O(n + n log f) groups in expectation for a text of length n, where f is the maximum token frequency. Each group takes O(1) space and represents a few subsequences sharing the same sampling. We prove this bound is tight: any algorithm must produce Omega(n + n log f) groups in expectation in the worst case. Experiments show that MONO outperforms the state of the art by up to 26x in index construction time, reduces index size by up to 30 percent, and improves query latency by up to 3x, while scaling well.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00627/full.md

## Figures

79 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00627/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/2509.00627/full.md

---
Source: https://tomesphere.com/paper/2509.00627