Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning
Alex Echeverria, S\'avio Salvarino Teles de Oliveira, Fernando Marques Federson

TL;DR
This paper introduces an automated end-to-end pipeline that transforms unstructured call center recordings into high-quality Q&A datasets, enabling effective fine-tuning of large language models for customer service applications.
Contribution
It presents a novel, fully automated pipeline for converting noisy call recordings into instructional Q&A datasets suitable for LLM fine-tuning, demonstrated with successful model training.
Findings
Generated dataset improved LLM fine-tuning performance.
Pipeline effectively handles noisy, unstructured audio data.
Codes are publicly available for reproducibility.
Abstract
The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge due to the noisy and disorganized nature of the data. This paper presents a solution to this challenge by offering an end-to-end automated pipeline for generating Q&A instructional datasets from such recordings. The methodology developed comprises sequential steps of audio processing (including diarization, noise removal and automatic transcription), textual processing (cleaning, normalization, and anonymization), semantic extraction of customer demands and attendant responses using vector embeddings, and matching via semantic search to form the final Q&A pairs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · AI in Service Interactions · Topic Modeling
