Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation
Philip Lippmann, Jie Yang

TL;DR
ZEST is a zero-shot framework that creates a synthetic corpus from a few examples to enable context-aware embeddings without target corpus access, achieving near-equivalent performance to models with full data.
Contribution
It introduces a novel offline synthetic corpus generation method for zero-shot domain adaptation of contextual embeddings, eliminating the need for target data access or fine-tuning.
Findings
Achieves within 0.5% of target-corpus models on MTEB benchmark
Requires only five exemplar documents for effective adaptation
Operates without any retraining or access to the target corpus
Abstract
Context-aware embedding methods boost retrieval accuracy by conditioning on corpus statistics (e.g., term co-occurrence and topical patterns) extracted from neighboring documents. However, this context-aware approach requires access to the target corpus or requires domain-specific finetuning, posing practical barriers in privacy-sensitive or resource-constrained settings. We present ZEST, a zero-shot contextual adaptation framework that replaces real corpus access with a one-time offline synthesis of a compact proxy. Given only a handful exemplar documents representative of the general target domain, we use a multi-step hierarchical procedure to generate a synthetic context corpus of several hundred documents that aims to emulate key domain-specific distributions. At inference, the frozen context-aware encoder uses this proxy corpus -- without any finetuning or target corpus access --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
