Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Yue Kang; Zhuoyi Huang; Benji Schussheim; Diana Licon; Dina Atia; Shixing Cao; Jacob Danovitch; Kunho Kim; Billy Norcilien; Jonah Karpman; Mahmound Sayed; Mike Taylor; Tao Sun; Pavel Metrikov; Vipul Agarwal; Chris Quirk; Ye-Yi Wang; Nick Craswell; Irene Shaffer; Tianwei Chen; Sulaiman Vesal; Soundar Srinivasan

arXiv:2601.03211·cs.IR·January 7, 2026

Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Yue Kang, Zhuoyi Huang, Benji Schussheim, Diana Licon, Dina Atia, Shixing Cao, Jacob Danovitch, Kunho Kim, Billy Norcilien, Jonah Karpman, Mahmound Sayed, Mike Taylor, Tao Sun, Pavel Metrikov, Vipul Agarwal, Chris Quirk, Ye-Yi Wang, Nick Craswell, Irene Shaffer, Tianwei Chen

PDF

Open Access

TL;DR

This paper presents a method to fine-tune small language models for enterprise relevance labeling, using synthetic data generation and distillation, achieving high accuracy, increased throughput, and cost efficiency.

Contribution

It introduces a novel approach combining synthetic data, teacher-student distillation, and fine-tuning of small models for scalable enterprise relevance labeling.

Findings

01

SLM relevance labels match or surpass teacher LLM accuracy

02

17x throughput increase in labeling process

03

19x cost reduction compared to large models

Abstract

In enterprise search, building high-quality datasets at scale remains a central challenge due to the difficulty of acquiring labeled data. To resolve this challenge, we propose an efficient approach to fine-tune small language models (SLMs) for accurate relevance labeling, enabling high-throughput, domain-specific labeling comparable or even better in quality to that of state-of-the-art large language models (LLMs). To overcome the lack of high-quality and accessible datasets in the enterprise domain, our method leverages on synthetic data generation. Specifically, we employ an LLM to synthesize realistic enterprise queries from a seed document, apply BM25 to retrieve hard negatives, and use a teacher LLM to assign relevance scores. The resulting dataset is then distilled into an SLM, producing a compact relevance labeler. We evaluate our approach on a high-quality benchmark consisting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications