Investigating Cost-Efficiency of LLM-Generated Training Data for   Conversational Semantic Frame Analysis

Shiho Matta; Yin Jou Huang; Fei Cheng; Hirokazu Kiyomaru; Yugo; Murawaki

arXiv:2410.06550·cs.CL·October 10, 2024

Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

Shiho Matta, Yin Jou Huang, Fei Cheng, Hirokazu Kiyomaru, Yugo, Murawaki

PDF

Open Access

TL;DR

This paper explores how to balance cost and quality in training data for conversational semantic analysis by combining human and GPT-4 generated data, optimizing performance across different budgets.

Contribution

It introduces a method for allocating budgets between human and LLM-generated data to maximize cost-efficiency in training models.

Findings

01

Optimal performance is achieved by combining human and LLM data.

02

Higher proportions of LLM data are preferable at lower budgets.

03

Mixing data sources improves cost-efficiency across various budget levels.

Abstract

Recent studies have demonstrated that few-shot learning allows LLMs to generate training data for supervised models at a low cost. However, the quality of LLM-generated data may not entirely match that of human-labeled data. This raises a crucial question: how should one balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data? In this paper, we synthesized training data for conversational semantic frame analysis using GPT-4 and examined how to allocate budgets optimally to achieve the best performance. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels. Notably, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings