SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLMs Without Any Crowdsourcing
Run Chen, Jun Shin, Julia Hirschberg

TL;DR
This paper introduces SYNTHEMPATHY, a large-scale empathetic dialogue corpus generated entirely by LLMs, enabling scalable development of empathetic language models without crowdsourcing.
Contribution
It presents a novel framework for creating a large empathetic dialogue dataset using LLMs, bypassing the need for costly crowdsourcing.
Findings
Fine-tuning Mistral 7B on SYNTHEMPATHY improves empathy scores.
The corpus contains 105,000 empathetic responses to real-life situations.
The approach demonstrates scalable data generation for empathetic dialogue modeling.
Abstract
Previous research has shown that humans are more receptive towards language models that that exhibit empathetic behavior. While empathy is essential for developing helpful dialogue agents, very few large corpora containing empathetic dialogues are available for fine-tune LLMs. The few existing corpora have largely relied on crowdsourcing to simulate empathetic conversations, a process that is expensive, time-consuming, and not scalable to larger datasets. We propose a data generation framework for developing SYNTHEMPATHY, a large corpus containing 105k empathetic responses to real-life situations compiled through LLM generation. A base Mistral 7B model fine-tuned on our SYNTHEMPATHY corpus exhibits an increase in the average empathy score.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Wikis in Education and Collaboration · Semantic Web and Ontologies
MethodsBalanced Selection
