A Comparative Study of Text Retrieval Models on DaReCzech

Jakub Stetina; Martin Fajcik; Michal Stefanik; Michal Hradis

arXiv:2411.12921·cs.IR·December 24, 2024

A Comparative Study of Text Retrieval Models on DaReCzech

Jakub Stetina, Martin Fajcik, Michal Stefanik, Michal Hradis

PDF

Open Access

TL;DR

This study evaluates seven document retrieval models on the Czech dataset DaReCzech, comparing their accuracy, speed, and memory usage, and analyzing the impact of language processing choices.

Contribution

It provides a comprehensive comparison of modern retrieval models on Czech data and insights into the best approaches for Czech language retrieval tasks.

Findings

01

Gemma2 achieved the highest precision and recall.

02

Contriever performed poorly among the models.

03

SPLADE and PLAID offered a good balance of efficiency and performance.

Abstract

This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques

MethodsSimCSE · Adaptive Discriminator Augmentation