Separating Semantic Expansion from Linear Geometry for PubMed-Scale Vector Search

Rob Koopman

arXiv:2601.05268·cs.IR·January 12, 2026

Separating Semantic Expansion from Linear Geometry for PubMed-Scale Vector Search

Rob Koopman

PDF

Open Access

TL;DR

This paper introduces a scalable biomedical retrieval system that separates semantic interpretation from geometric space, enabling efficient PubMed-scale vector search without training parameters.

Contribution

It presents a novel framework that uses language models and geometric transformations to improve large-scale biomedical document retrieval.

Findings

01

Effective retrieval of biomedical clusters across 40 million records

02

Achieves exact cosine search on 256-dimensional int8 vectors

03

Evaluation based on geometric metrics like cosine similarity and isotropy

Abstract

We describe a PubMed scale retrieval framework that separates semantic interpretation from metric geometry. A large language model expands a natural language query into concise biomedical phrases; retrieval then operates in a fixed, mean free, approximately isotropic embedding space. Each document and query vector is formed as a weighted mean of token embeddings, projected onto the complement of nuisance axes and compressed by a Johnson Lindenstrauss transform. No parameters are trained. The system retrieves coherent biomedical clusters across the full MEDLINE corpus (about 40 million records) using exact cosine search on 256 dimensional int8 vectors. Evaluation is purely geometric: head cosine, compactness, centroid closure, and isotropy are compared with random vector baselines. Recall is not defined, since the language-model expansion specifies the effective target set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Machine Learning in Healthcare · Topic Modeling