Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

Paz Mor Naim; Shiri Sadeh-Sharvit; Samuel Jefroykin; Eddie Silber; Dennis P Morrison; Ariel Goldstein

PMC · DOI:10.2196/78082·October 24, 2025

Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

Paz Mor Naim, Shiri Sadeh-Sharvit, Samuel Jefroykin, Eddie Silber, Dennis P Morrison, Ariel Goldstein

PDF

Open Access

TL;DR

This paper introduces a framework for preprocessing noisy conversational datasets, especially in behavioral health, using a mix of statistical analysis, human annotation, and large language models.

Contribution

The novel contribution is a hybrid framework combining feature extraction, human input, and LLMs to filter non-session transcripts in behavioral health data.

Findings

01

Approximately one-third of transcripts had transcription errors like incomprehensible segments or speaker misattribution.

02

Zero-shot LLM prompting achieved moderate agreement with expert annotations (κ=0.71) in classifying sessions vs non-sessions.

03

High speaking rate and short duration were indicators of non-therapy conversations like answering machine messages.

Abstract

The rise of artificial intelligence and accessible audio equipment has led to a proliferation of recorded conversation transcripts datasets across various fields. However, automatic mass recording and transcription often produce noisy, unstructured data that contain unintended recordings such as hallway conversations, media (eg, TV, radio), or transcription inaccuracies as speaker misattribution or misidentified words. As a result, large conversational transcript datasets require careful preprocessing and filtering to ensure their research utility. This challenge is particularly relevant in behavioral health contexts (eg, therapy, counseling) where deriving meaningful insights, specifically dynamic processes, depends on accurate conversation representation. We present a framework for preprocessing large datasets of conversational transcripts and filtering out non-sessions—transcripts…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Misinformation and Its Impacts