SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts
Marc Brinner, Sina Zarriess

TL;DR
SemCSE is an unsupervised method that uses LLM-generated summaries to learn semantic embeddings of scientific abstracts, improving semantic understanding and separation in the embedding space.
Contribution
We propose SemCSE, a novel unsupervised approach leveraging LLM summaries for better semantic embeddings of scientific texts, validated by new benchmarks.
Findings
Achieves state-of-the-art results on SciRepEval benchmark.
Enforces stronger semantic separation in embeddings.
Outperforms existing models of similar size.
Abstract
We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
MethodsContrastive Learning
