Efficient data selection employing Semantic Similarity-based Graph   Structures for model training

Roxana Petcu; Subhadeep Maji

arXiv:2402.14888·cs.LG·February 26, 2024·2 cites

Efficient data selection employing Semantic Similarity-based Graph Structures for model training

Roxana Petcu, Subhadeep Maji

PDF

Open Access

TL;DR

This paper presents SeSaME, a semantic similarity-based graph method for efficient data selection in NLP, significantly improving model performance estimation without heavy computation.

Contribution

Introduces SeSaME, a novel semantic similarity-based graph approach for data sampling that enhances speech recognition model training and performance prediction.

Findings

01

93% accuracy increase in ASR performance projection

02

7% reduction in validation loss using the method

03

7% WER reduction on difficult datasets

Abstract

Recent developments in natural language processing (NLP) have highlighted the need for substantial amounts of data for models to capture textual information accurately. This raises concerns regarding the computational resources and time required for training such models. This paper introduces Semantics for data SAliency in Model performance Estimation (SeSaME). It is an efficient data sampling mechanism solely based on textual information without passing the data through a compute-heavy model or other intensive pre-processing transformations. The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models, which excessively rely on text-to-speech (TTS) calls when using augmented data. SeSaME learns to categorize new incoming data points into speech recognition difficulty buckets by employing semantic similarity-based graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Advanced Graph Neural Networks · Semantic Web and Ontologies