Curating Grounded Synthetic Data with Global Perspectives for Equitable   AI

Elin T\"ornquist; Robert Alexander Caulk

arXiv:2406.10258·cs.CL·June 19, 2024·2 cites

Curating Grounded Synthetic Data with Global Perspectives for Equitable AI

Elin T\"ornquist, Robert Alexander Caulk

PDF

Open Access 4 Models

TL;DR

This paper presents a method for creating diverse, culturally rich synthetic datasets from multilingual news sources to improve AI model robustness and generalizability, especially in data-scarce scenarios.

Contribution

We introduce a novel synthetic data generation approach grounded in real-world diversity, enhancing AI training data with multilingual, multicultural content for better generalization.

Findings

01

Up to 7.3% performance improvement on NER benchmarks

02

Synthetic data effectively captures global linguistic and cultural diversity

03

Methodology applicable to various AI domains for data diversification

Abstract

The development of robust AI models relies heavily on the quality and variety of training data available. In fields where data scarcity is prevalent, synthetic data generation offers a vital solution. In this paper, we introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification. We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations. Through enforced topic diversification, translation, and summarization, the resulting dataset accurately mirrors real-world complexities and addresses the issue of underrepresentation in traditional datasets. This methodology, applied initially to Named Entity Recognition (NER), serves as a model for numerous AI disciplines where data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Big Data Technologies and Applications