Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval
Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, William M. Campbell

TL;DR
This paper introduces an unsupervised method for text representation learning using instruction-tuned large language models, significantly improving zero-shot dense retrieval performance across multiple datasets.
Contribution
It proposes a novel unsupervised approach that leverages instruction-tuning of LLMs to generate synthetic queries, enhancing corpus representations for zero-shot retrieval.
Findings
Significant improvement in zero-shot retrieval metrics across datasets.
Outperforms several dense retrievers with smaller model sizes.
Effective use of synthetic queries for corpus representation enhancement.
Abstract
Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Natural Language Processing Techniques
MethodsFlan-T5 · ALIGN
