ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings
Ali Shiraee Kasmaee, Mohammad Khodadad, Mehdi Astaraki, Mohammad Arshi Saloot, Nicholas Sherck, Hamidreza Mahyar, Soheila Samiee

TL;DR
ChEmbed is a domain-specific text embedding model for chemical literature retrieval, trained on synthetic queries and chemical corpora, significantly improving retrieval accuracy over general models.
Contribution
We developed ChEmbed, a specialized chemical literature embedding model with extended tokenization and context length, trained on synthetic data, outperforming general models on a new benchmark.
Findings
ChEmbed achieves higher nDCG@10 scores (0.91) compared to general models (0.82).
Adding chemically specialized tokens reduces entity fragmentation.
Extended context length enables retrieval of longer passages.
Abstract
Retrieval-Augmented Generation (RAG) systems in chemistry heavily depend on accurate and relevant retrieval of chemical literature. However, general-purpose text embedding models frequently fail to adequately represent complex chemical terminologies, resulting in suboptimal retrieval quality. Specialized embedding models tailored to chemical literature retrieval have not yet been developed, leaving a substantial performance gap. To address this challenge, we introduce ChEmbed, a domain-adapted family of text embedding models fine-tuned on a dataset comprising chemistry-specific text from the PubChem, Semantic Scholar, and ChemRxiv corpora. To create effective training data, we employ large language models to synthetically generate queries, resulting in approximately 1.7 million high-quality query-passage pairs. Additionally, we augment the tokenizer by adding 900 chemically specialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Biomedical Text Mining and Ontologies · Computational Drug Discovery Methods
