CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
Lam Thanh Do, Linh Van Nguyen, Jiayu Li, David Fu, Kevin Chen-Chuan Chang

TL;DR
CASPER is a novel sparse retrieval model that enhances scientific search by integrating research concepts through tokens and keyphrases, outperforming existing methods across multiple benchmarks.
Contribution
Introduces CASPER, a concept-aware sparse retrieval model using tokens and keyphrases, with a new training data construction method leveraging scholarly references.
Findings
Outperforms dense and sparse baselines on eight benchmarks.
Effective trade-off between performance and efficiency via pruning.
Demonstrates interpretability as a keyphrase generation model.
Abstract
Identifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Image Retrieval and Classification Techniques · Web Data Mining and Analysis
