Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Minsang Kim; Seungjun Baek

arXiv:2502.03824·cs.CL·February 17, 2025

Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Minsang Kim, Seungjun Baek

PDF

Open Access 1 Repo

TL;DR

Syntriever introduces a novel training framework for information retrieval systems that leverages synthetic data generated by black-box LLMs, improving retrieval performance without needing access to LLM output probabilities.

Contribution

It presents a two-stage training process combining synthetic data generation and preference alignment, enabling effective retriever training using only black-box LLMs.

Findings

01

Achieves state-of-the-art results on multiple benchmark datasets.

02

Effective synthetic data generation for training retrievers.

03

Preference alignment improves retrieval relevance.

Abstract

LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kmswin1/syntriever
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsALIGN