Semantic sentence similarity: size does not always matter

Danny Merkx; Stefan L. Frank; Mirjam Ernestus

arXiv:2106.08648·cs.CL·March 31, 2022

Semantic sentence similarity: size does not always matter

Danny Merkx, Stefan L. Frank, Mirjam Ernestus

PDF

1 Repo

TL;DR

This paper demonstrates that visually grounded speech recognition models can effectively learn sentence semantics with smaller datasets, emphasizing the importance of data quality and paraphrasing over sheer size.

Contribution

It shows that smaller, well-structured datasets with paraphrasing can outperform larger datasets in training semantic speech models.

Findings

01

Smaller datasets can produce embeddings correlating well with human judgments.

02

Multiple captions per image improve semantic learning even with fewer images.

03

Dataset quality and paraphrasing are crucial, not just size.

Abstract

This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DannyMerkx/speech2image
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdam · Cyclical Learning Rate Policy · Sigmoid Activation · Tanh Activation · Long Short-Term Memory