Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search

Kayhan Behdin; Qingquan Song; Sriram Vasudevan; Jian Sheng; Xiaojing Ma; Z Zhou; Chuanrui Zhu; Guoyao Li; Chanh Nguyen; Sayan Ghosh; Hejian Sang; Ata Fatahi Baarzi; Sundara Raman Ramachandran; Xiaoqing Wang; Qing Lan; Vinay Y S; Qi Guo; Caleb Johnson; Zhipeng Wang; Fedor Borisyuk

arXiv:2510.22101·cs.IR·October 28, 2025

Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search

Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk

PDF

TL;DR

This paper presents methods for compressing and optimizing small language models to enable efficient, large-scale semantic search deployment at LinkedIn, significantly reducing costs and latency while maintaining accuracy.

Contribution

It introduces model and context compression techniques, along with deployment optimizations, to improve the efficiency of small language models in real-world semantic search applications.

Findings

01

Model size reduced by up to 40% with pruning.

02

Input context length reduced by up to 10x with minimal accuracy loss.

03

System throughput increased by 10x on GPU deployment.

Abstract

Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$ x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.