ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot,, Nicholas Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee

TL;DR
ChemTEB introduces a specialized benchmark for evaluating chemical text embedding models, addressing domain-specific challenges and providing insights into model performance in chemical literature.
Contribution
The paper presents ChemTEB, a new benchmark tailored for chemical domain embedding models, including evaluation of 34 models and open-source resources.
Findings
Current models show varied strengths and weaknesses in chemical data processing.
ChemTEB facilitates standardized evaluation of chemical text embeddings.
Open-source code and data support further research and development.
Abstract
Recent advancements in language models have started a new era of superior information retrieval and content generation, with embedding models playing an important role in optimizing data representation efficiency and performance. While benchmarks like the Massive Text Embedding Benchmark (MTEB) have standardized the evaluation of general domain embedding models, a gap remains in specialized fields such as chemistry, which require tailored approaches due to domain-specific challenges. This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB), designed specifically for the chemical sciences. ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data, offering a comprehensive suite of tasks on chemical domain data. Through the evaluation of 34 open-source and proprietary models using this benchmark, we illuminate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Biomedical Text Mining and Ontologies
