# Preference-driven Similarity Join

**Authors:** Chuancong Gao, Jiannan Wang, Jian Pei, Rui Li, Yi Chang

arXiv: 1706.04266 · 2017-07-13

## TL;DR

This paper introduces a preference-driven approach to similarity join that eliminates the need for threshold tuning by allowing users to select from multiple result-set preferences, optimizing results based on user-defined objectives.

## Contribution

It proposes a novel framework that replaces threshold-based methods with preference-based optimization, demonstrated through two preferences and effective algorithms.

## Key findings

- Achieves high-quality results without threshold tuning.
- Effectively handles diverse real-world datasets.
- Outperforms traditional threshold-driven methods.

## Abstract

Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the past, assumes that a user is able to specify a similarity threshold, and then focuses on how to efficiently return the object pairs whose similarities pass the threshold. We argue that the assumption about a well set similarity threshold may not be valid for two reasons. The optimal thresholds for different similarity join tasks may vary a lot. Moreover, the end-to-end time spent on similarity join is likely to be dominated by a back-and-forth threshold-tuning process.   In response, we propose preference-driven similarity join. The key idea is to provide several result-set preferences, rather than a range of thresholds, for a user to choose from. Intuitively, a result-set preference can be considered as an objective function to capture a user's preference on a similarity join result. Once a preference is chosen, we automatically compute the similarity join result optimizing the preference objective. As the proof of concept, we devise two useful preferences and propose a novel preference-driven similarity join framework coupled with effective optimization techniques. Our approaches are evaluated on four real-world web datasets from a diverse range of application scenarios. The experiments show that preference-driven similarity join can achieve high-quality results without a tedious threshold-tuning process.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.04266/full.md

## Figures

34 figures with captions in the complete paper: https://tomesphere.com/paper/1706.04266/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/1706.04266/full.md

---
Source: https://tomesphere.com/paper/1706.04266