Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb

Wendy Ran Wei; Hao Li; Weiwei Guo; Xiaowei Liu; Xueyin Chen; Dillon Davis; Malay Haldar; Soumyadip Banerjee; Kedar Bellare; Huiji Gao; Stephanie Moyerman; Sanjeev Katariya

arXiv:2605.21812·cs.IR·May 22, 2026

Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb

Wendy Ran Wei, Hao Li, Weiwei Guo, Xiaowei Liu, Xueyin Chen, Dillon Davis, Malay Haldar, Soumyadip Banerjee, Kedar Bellare, Huiji Gao, Stephanie Moyerman, Sanjeev Katariya

PDF

TL;DR

This paper introduces a framework using large language models to generate synthetic queries and labels, addressing the cold-start problem in Airbnb's natural language search systems, and demonstrating improved realism and diversity over baselines.

Contribution

The authors propose a novel seed-guided synthetic data generation method leveraging LLMs, enhancing cold-start performance and evaluation for NLP search systems.

Findings

01

Seed-guided approach achieves 0.66 KL divergence in query length, outperforming InPars baseline.

02

Approach achieves the lowest attribute type KL divergence of 0.04, better than seed queries.

03

Synthetic data produces harder evaluation examples, improving model discriminative ability.

Abstract

Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.