Unsupervised Text Representation Learning via Instruction-Tuning for   Zero-Shot Dense Retrieval

Qiuhai Zeng; Zimeng Qiu; Dae Yon Hwang; Xin He; William M. Campbell

arXiv:2409.16497·cs.AI·September 26, 2024

Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval

Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, William M. Campbell

PDF

Open Access

TL;DR

This paper introduces an unsupervised method for text representation learning using instruction-tuned large language models, significantly improving zero-shot dense retrieval performance across multiple datasets.

Contribution

It proposes a novel unsupervised approach that leverages instruction-tuning of LLMs to generate synthetic queries, enhancing corpus representations for zero-shot retrieval.

Findings

01

Significant improvement in zero-shot retrieval metrics across datasets.

02

Outperforms several dense retrievers with smaller model sizes.

03

Effective use of synthetic queries for corpus representation enhancement.

Abstract

Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Natural Language Processing Techniques

MethodsFlan-T5 · ALIGN