Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan, Yanqi Yang, Leo Huang

TL;DR
This paper introduces SalesLLM, a comprehensive bilingual benchmark for evaluating large language models in realistic sales dialogue scenarios, including automatic evaluation methods and a new user simulation model.
Contribution
It presents a new benchmark with a large dataset, automatic evaluation pipeline, and a user model to better assess LLMs in sales tasks, addressing limitations of existing benchmarks.
Findings
SalesLLM scores strongly correlate with human ratings (Pearson r=0.98).
Top LLMs perform comparably to humans in sales dialogues.
Less capable LLMs underperform compared to humans.
Abstract
Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress,and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000+ crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
