SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion
Muntasir Wahed, Daniel Gruhl, Alfredo Alba, Anna Lisa Gentile, Petar, Ristoski, Chad Deluca, Steve Welch, Ismini Lourentzou

TL;DR
SAUCE introduces a novel sparse document signature method that enables fast, scalable web-scale corpus expansion from limited seed data, effectively capturing domain-specific terms with reduced computational costs.
Contribution
The paper proposes SAUCE, a truncated sparse bit-vector representation for efficient corpus expansion, addressing computational challenges and long-tail term coverage in domain-specific text retrieval.
Findings
SAUCE significantly reduces computational time compared to traditional methods.
SAUCE achieves high lexical coverage of domain-specific terms.
Experimental results validate SAUCE's effectiveness in large-scale corpus expansion.
Abstract
Recent advances in text representation have shown that training on large amounts of text is crucial for natural language understanding. However, models trained without predefined notions of topical interest typically require careful fine-tuning when transferred to specialized domains. When a sufficient amount of within-domain text may not be available, expanding a seed corpus of relevant documents from large-scale web data poses several challenges. First, corpus expansion requires scoring and ranking each document in the collection, an operation that can quickly become computationally expensive as the web corpora size grows. Relying on dense vector spaces and pairwise similarity adds to the computational expense. Secondly, as the domain concept becomes more nuanced, capturing the long tail of domain-specific rare terms becomes non-trivial, especially under limited seed corpora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
