SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Marc Brinner; Sina Zarriess

arXiv:2507.13105·cs.CL·July 18, 2025

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Marc Brinner, Sina Zarriess

PDF

Open Access 2 Models 1 Video

TL;DR

SemCSE is an unsupervised method that uses LLM-generated summaries to learn semantic embeddings of scientific abstracts, improving semantic understanding and separation in the embedding space.

Contribution

We propose SemCSE, a novel unsupervised approach leveraging LLM summaries for better semantic embeddings of scientific texts, validated by new benchmarks.

Findings

01

Achieves state-of-the-art results on SciRepEval benchmark.

02

Enforces stronger semantic separation in embeddings.

03

Outperforms existing models of similar size.

Abstract

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies

MethodsContrastive Learning