Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums
Isabelle Lorge, Dan W. Joyce, Andrey Kormilitzin

TL;DR
This study evaluates large language models' ability to identify mental health factors in adolescent online forum posts, showing GPT-4 performs comparably to experts and synthetic data improves annotation accuracy.
Contribution
It introduces a new dataset of adolescent Reddit posts annotated for mental health factors and compares LLM performance to expert psychiatrists in this domain.
Findings
GPT-4 matches expert inter-annotator agreement
Synthetic data enhances LLM annotation performance
Models struggle with negation and factuality issues
Abstract
Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing
MethodsSparse Evolutionary Training
