Similarity Driven Approximation for Text Analytics
Guangyan Hu, Yongfeng Zhang, Sandro Rigo, Thu D. Nguyen

TL;DR
This paper introduces EmApprox, a framework that accelerates large-scale text analytics by using similarity-guided sampling based on learned vector representations, enabling faster query processing with controlled accuracy loss.
Contribution
EmApprox is a novel approximation framework that leverages NLP-based vector models to efficiently sample data subsets for diverse text queries, reducing computation time.
Findings
EmApprox achieves up to 10x speedup in query processing.
Similarity-guided sampling outperforms random sampling in accuracy.
Small sampling fractions still maintain acceptable error levels.
Abstract
Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Recommender Systems and Techniques · Data Stream Mining Techniques
