The Effects of Data Size and Frequency Range on Distributional Semantic Models
Magnus Sahlgren, Alessandro Lenci

TL;DR
This study examines how data size and frequency range influence distributional semantic models, revealing that neural models struggle with small data and that the inverted factorized model is most reliable across conditions.
Contribution
It provides a comparative analysis of different semantic models under varying data sizes and frequency ranges, highlighting the robustness of the inverted factorized model.
Findings
Neural network models underperform with small datasets.
The inverted factorized model is most reliable across different data sizes.
Model performance varies significantly with data size and frequency range.
Abstract
This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small, and that the most reliable model over data of varying sizes and frequency ranges is the inverted factorized model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
