Fill In The Gaps: Model Calibration and Generalization with Synthetic   Data

Yang Ba; Michelle V. Mancenido; and Rong Pan

arXiv:2410.10864·cs.CL·October 16, 2024

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Yang Ba, Michelle V. Mancenido, and Rong Pan

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel calibration method using synthetic data, particularly leveraging large language models, to improve model accuracy and calibration without sacrificing generalizability across NLP tasks.

Contribution

The paper proposes a synthetic data-based calibration approach utilizing LLMs, deriving ECE bounds under PAC learning, and demonstrating significant accuracy and calibration improvements.

Findings

01

Up to 34% increase in accuracy

02

33% decrease in expected calibration error

03

Effective across four NLP tasks

Abstract

As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data· underline

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques