Benchmarking and Learning Real-World Customer Service Dialogue
Tianhong Gao, Jundong Shen, Jiapeng Wang, Bei Shi, Ying Ju, Junfeng Yao, Huiyu Yu

TL;DR
This paper introduces OlaBench, a comprehensive customer service dialogue benchmark, and OlaMind, a reinforcement learning approach that improves large language models' performance in real-world ICS tasks, bridging offline metrics and deployment success.
Contribution
The paper presents OlaBench for realistic ICS evaluation and OlaMind for reinforcement learning-based model improvement, addressing gaps in existing benchmarks and training pipelines.
Findings
OlaMind outperforms GPT-5.2 and Gemini 3 Pro on OlaBench.
OlaMind achieves +23.67% issue resolution in online tests.
OlaBench evaluates service capability, safety, and latency.
Abstract
Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies rubric-aware staged exploration--exploitation reinforcement learning to improve model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
