SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez

TL;DR
This paper introduces SSE, a framework for selecting and enriching large-scale industrial data using semantic diversity, which maintains or improves model performance with less or more data, respectively.
Contribution
The paper presents a novel semantic selection and enrichment framework that enhances data efficiency and model performance in industrial AI applications.
Findings
Semantic selection maintains model performance with less data.
Semantic enrichment improves model performance without increasing dataset size.
Semantic diversity is crucial for optimal data selection.
Abstract
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Computational Techniques and Applications · Time Series Analysis and Forecasting
