Formalising lexical and syntactic diversity for data sampling in French
Louis Est\`eve, Manon Scholivet, Agata Savary

TL;DR
This paper introduces a heuristic for sampling French datasets to maximize lexical and syntactic diversity, examining their correlation and highlighting the complexity of effectively capturing dataset diversity.
Contribution
It presents a heuristic method to improve dataset diversity sampling and investigates the relationship between lexical and syntactic diversity in French datasets.
Findings
The heuristic significantly increases dataset diversity compared to random sampling.
Correlations between lexical and syntactic diversity vary across datasets and measures.
Arbitrary diversity measures may not fully capture dataset diversity properties.
Abstract
Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Linguistic Variation and Morphology
