Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups
Salar Hashemitaheri, Ian Harris

TL;DR
This paper presents a novel data augmentation framework combining synthetic and real data to improve intent classification in a conversational agent for smoking cessation support groups, resulting in significant performance gains.
Contribution
It introduces a two-level data augmentation strategy using GPT-generated synthetic data and real post scraping to enhance classifier accuracy in a domain with limited data.
Findings
32% improvement in F1 score after augmentation
Synthetic and real data augmentation yield similar performance gains
High-quality synthetic data generated with 87% human-validated relevance
Abstract
Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmoking Behavior and Cessation · Digital Mental Health Interventions · Social Media in Health Education
