Separating Semantic Expansion from Linear Geometry for PubMed-Scale Vector Search
Rob Koopman

TL;DR
This paper introduces a scalable biomedical retrieval system that separates semantic interpretation from geometric space, enabling efficient PubMed-scale vector search without training parameters.
Contribution
It presents a novel framework that uses language models and geometric transformations to improve large-scale biomedical document retrieval.
Findings
Effective retrieval of biomedical clusters across 40 million records
Achieves exact cosine search on 256-dimensional int8 vectors
Evaluation based on geometric metrics like cosine similarity and isotropy
Abstract
We describe a PubMed scale retrieval framework that separates semantic interpretation from metric geometry. A large language model expands a natural language query into concise biomedical phrases; retrieval then operates in a fixed, mean free, approximately isotropic embedding space. Each document and query vector is formed as a weighted mean of token embeddings, projected onto the complement of nuisance axes and compressed by a Johnson Lindenstrauss transform. No parameters are trained. The system retrieves coherent biomedical clusters across the full MEDLINE corpus (about 40 million records) using exact cosine search on 256 dimensional int8 vectors. Evaluation is purely geometric: head cosine, compactness, centroid closure, and isotropy are compared with random vector baselines. Recall is not defined, since the language-model expansion specifies the effective target set.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Healthcare · Topic Modeling
