Curating Grounded Synthetic Data with Global Perspectives for Equitable AI
Elin T\"ornquist, Robert Alexander Caulk

TL;DR
This paper presents a method for creating diverse, culturally rich synthetic datasets from multilingual news sources to improve AI model robustness and generalizability, especially in data-scarce scenarios.
Contribution
We introduce a novel synthetic data generation approach grounded in real-world diversity, enhancing AI training data with multilingual, multicultural content for better generalization.
Findings
Up to 7.3% performance improvement on NER benchmarks
Synthetic data effectively captures global linguistic and cultural diversity
Methodology applicable to various AI domains for data diversification
Abstract
The development of robust AI models relies heavily on the quality and variety of training data available. In fields where data scarcity is prevalent, synthetic data generation offers a vital solution. In this paper, we introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification. We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations. Through enforced topic diversification, translation, and summarization, the resulting dataset accurately mirrors real-world complexities and addresses the issue of underrepresentation in traditional datasets. This methodology, applied initially to Named Entity Recognition (NER), serves as a model for numerous AI disciplines where data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Big Data Technologies and Applications
