Some Like It Small: Czech Semantic Embedding Models for Industry   Applications

Ji\v{r}\'i Bedn\'a\v{r}; Jakub N\'aplava; Petra Baran\v{c}\'ikov\'a,; Ond\v{r}ej Lisick\'y

arXiv:2311.13921·cs.CL·November 27, 2023·1 cites

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

Ji\v{r}\'i Bedn\'a\v{r}, Jakub N\'aplava, Petra Baran\v{c}\'ikov\'a,, Ond\v{r}ej Lisick\'y

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper develops small Czech sentence embedding models optimized for industry use, demonstrating their efficiency and effectiveness in real-world search applications with significant size and speed advantages.

Contribution

It introduces and evaluates compact Czech sentence embedding models using innovative training techniques, achieving competitive performance with much smaller and faster models.

Findings

01

Models are approximately 8 times smaller and 5 times faster than larger counterparts.

02

The models outperform previous versions in search-related tasks.

03

Public release of models and evaluation pipeline promotes reproducibility.

Abstract

This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seznam/czech-semantic-embedding-models
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings