Efficient data selection employing Semantic Similarity-based Graph Structures for model training
Roxana Petcu, Subhadeep Maji

TL;DR
This paper presents SeSaME, a semantic similarity-based graph method for efficient data selection in NLP, significantly improving model performance estimation without heavy computation.
Contribution
Introduces SeSaME, a novel semantic similarity-based graph approach for data sampling that enhances speech recognition model training and performance prediction.
Findings
93% accuracy increase in ASR performance projection
7% reduction in validation loss using the method
7% WER reduction on difficult datasets
Abstract
Recent developments in natural language processing (NLP) have highlighted the need for substantial amounts of data for models to capture textual information accurately. This raises concerns regarding the computational resources and time required for training such models. This paper introduces Semantics for data SAliency in Model performance Estimation (SeSaME). It is an efficient data sampling mechanism solely based on textual information without passing the data through a compute-heavy model or other intensive pre-processing transformations. The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models, which excessively rely on text-to-speech (TTS) calls when using augmented data. SeSaME learns to categorize new incoming data points into speech recognition difficulty buckets by employing semantic similarity-based graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Advanced Graph Neural Networks · Semantic Web and Ontologies
