Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining
Mahmoud Amiri, Jamile Mohammad Jafari, Sara Mostafapour, Thomas Bocklitz

TL;DR
Lit2Vec offers a reproducible workflow for building a legally compliant, structured chemistry research corpus from S2ORC, supporting retrieval and text mining with validated metadata and embeddings.
Contribution
It introduces a reproducible, license-aware workflow for constructing and validating a large chemistry corpus with structured text, embeddings, and annotations, including accompanying resources.
Findings
Assembled a corpus of 582,683 chemistry articles with structured full text.
Generated paragraph-level embeddings using the intfloat/e5-large-v2 model.
Validated the corpus for schema compliance, reproducibility, and text quality.
Abstract
We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an internal study corpus of 582,683 chemistry-specific full-text research articles with structured full text, token-aware paragraph chunks, paragraph-level embeddings generated with the intfloat/e5-large-v2 model, and record-level metadata including abstracts and licensing information. To support downstream retrieval and text-mining use cases, an eligible subset of the corpus was additionally enriched with machine-generated brief summaries and multi-label subfield annotations spanning 18 chemistry domains. Licensing was screened using metadata from Unpaywall, OpenAlex, and Crossref, and the resulting corpus was technically validated for schema compliance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
