CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
Raghav Garg, Kapil Sharma, Karan Gupta

TL;DR
CXMArena is a large-scale synthetic benchmark dataset designed to evaluate AI performance in realistic customer experience management scenarios, addressing current limitations in data scarcity and benchmark realism.
Contribution
The paper introduces CXMArena, a scalable LLM-powered pipeline for creating realistic CXM datasets and benchmarks across five operational tasks, filling a critical gap in practical evaluation tools.
Findings
State-of-the-art models achieve only 68% accuracy in article search.
Knowledge base refinement models have a low F1 score of 0.3.
Benchmark difficulty highlights the need for advanced models and solutions.
Abstract
Large Language Models (LLMs) hold immense potential for revolutionizing Customer Experience Management (CXM), particularly in contact center operations. However, evaluating their practical utility in complex operational environments is hindered by data scarcity (due to privacy concerns) and the limitations of current benchmarks. Existing benchmarks often lack realism, failing to incorporate deep knowledge base (KB) integration, real-world noise, or critical operational tasks beyond conversational fluency. To bridge this gap, we introduce CXMArena, a novel, large-scale synthetic benchmark dataset specifically designed for evaluating AI in operational CXM contexts. Given the diversity in possible contact center features, we have developed a scalable LLM-powered pipeline that simulates the brand's CXM entities that form the foundation of our datasets-such as knowledge articles including…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Comprehensive benchmark covering core CXM tasks beyond usual fluency. The CXM tasks usually have real word scenario use cases so the manuscript is useful. 2. Synthetic yet realistic data with strong alignment to real-world metrics. The realistic nature of this data has very strong alignment and large scale value in very in depth metrics. 3. Cross-domain and multilingual support (English, French, German). The cross domain languages are very well used in here. 4. Provides baseline results f
1. Although the manuscript is being used for synthetic, it may miss subtle nuances of real human behavior. Also LLM with human as a judge could be helpful to explore that might strengthen the findings 2. Dependent on biases and limitations of LLMs used for generation. 3. Currently focused on one domain with limited real-world diversity. Also multi domain alignment could be helpful
The authors bring forth a very important problem and one which practitioners constantly face. One very important aspect of public datasets and papers is that they very rarely to industry situations and the authors point it out very well. They also do a comprehensive study of the different challenges situations in Customer Service via LLMs in Section 2 and bring about limitations of exisiting datasets very well.
The authors highlight that most public datasets have limitations. This is a very valid concern and they have given significant citations establishing the limitations of existing datasets in Section 2. What is not clear is how is this alleviated in their work. Looking at one example they mention "We simulate real-world data quality issues by introducing controlled redundant and contradictory information from one article to another, creating data for developing KB maintenance techniques" - it is n
- Real-world distribution because of controlled noise injection (simulated ASR errors, interaction fragments) from SMEs and rigorous automated validation. - Authors introduce five tasks: Knowledge Base Refinement, Intent Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with Integrated Tools. - Pipeline applied to different domains and languages. - The authors introduce a pipeline to synthetically generate the knowledge base specific to a fictional brand and then uses th
- My main concern is on the correctness of the synthetic data using an LLM (Gemini in this case) and the LLM as a judge evaluation without a human in the loop. - Contradiction detection baseline would be very insightful since this is one of the tasks needed for Knowledge Base refinement. - Do you generate all this dataset synthetically given a seed prompt about the brand name and its type? I'm not convinced how you can ensure highly fidelity data since there is no human in the loop. How do yo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Sentiment Analysis and Opinion Mining
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · Linear Layer · Weight Decay
