Formalising lexical and syntactic diversity for data sampling in French

Louis Est\`eve; Manon Scholivet; Agata Savary

arXiv:2501.08003·cs.CL·January 15, 2025

Formalising lexical and syntactic diversity for data sampling in French

Louis Est\`eve, Manon Scholivet, Agata Savary

PDF

Open Access

TL;DR

This paper introduces a heuristic for sampling French datasets to maximize lexical and syntactic diversity, examining their correlation and highlighting the complexity of effectively capturing dataset diversity.

Contribution

It presents a heuristic method to improve dataset diversity sampling and investigates the relationship between lexical and syntactic diversity in French datasets.

Findings

01

The heuristic significantly increases dataset diversity compared to random sampling.

02

Correlations between lexical and syntactic diversity vary across datasets and measures.

03

Arbitrary diversity measures may not fully capture dataset diversity properties.

Abstract

Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Linguistic Variation and Morphology