Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?
John Mavi, Nathan Summers, Sergio Coronado

TL;DR
This paper introduces SelfScore, a benchmark for evaluating LLMs on help desk tasks, showing automated agents outperform humans and raising concerns about AI-driven job displacement.
Contribution
The paper develops SelfScore, a new benchmark for assessing LLMs in professional help desk tasks, and demonstrates the effectiveness of RAG-enhanced models over human performance.
Findings
Automated LLM agents outperform human workers on help desk tasks.
Retrieval-Augmented Generation improves domain-specific LLM performance.
SelfScore enables transparent comparison of AI and human agents.
Abstract
The current paper presents the development and validation of SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. Given the increasing integration of AI in industries, particularly within customer service, SelfScore fills a crucial gap by enabling the comparison of automated agents and human workers. The benchmark evaluates agents on problem complexity and response helpfulness, ensuring transparency and simplicity in its scoring system. The study also develops automated LLM agents to assess SelfScore and explores the benefits of Retrieval-Augmented Generation (RAG) for domain-specific tasks, demonstrating that automated LLM agents incorporating RAG outperform those without. All automated LLM agents were observed to perform better than the human control group. Given these results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivate Equity and Venture Capital · Entrepreneurship Studies and Influences · Genetics, Bioinformatics, and Biomedical Research
Methodstravel james · Attention Is All You Need · Linear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Residual Connection · Weight Decay · Byte Pair Encoding
