L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi
Ananya Joshi, Aditi Kajale, Janhavi Gadre, Samruddhi Deode, Raviraj, Joshi

TL;DR
This paper develops and benchmarks sentence-BERT models for Hindi and Marathi using synthetic datasets, demonstrating their effectiveness over existing multilingual models in low-resource language tasks.
Contribution
It introduces L3Cube-MahaSBERT and HindSBERT, the first high-performance sentence-BERT models for Marathi and Hindi, trained with synthetic datasets, and provides a comprehensive benchmarking analysis.
Findings
Synthetic data training yields high-quality embeddings for low-resource languages.
The proposed models outperform complex multilingual models like LaBSE.
Embeddings generalize well to real downstream tasks.
Abstract
Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗l3cube-pune/marathi-sentence-similarity-sbertmodel· 162 dl· ♡ 3162 dl♡ 3
- 🤗l3cube-pune/hindi-sentence-similarity-sbertmodel· 2.5k dl· ♡ 72.5k dl♡ 7
- 🤗l3cube-pune/hindi-sentence-bert-nlimodel· 33 dl· ♡ 233 dl♡ 2
- 🤗l3cube-pune/marathi-sentence-bert-nlimodel· 95 dl· ♡ 195 dl♡ 1
- 🤗l3cube-pune/bengali-sentence-similarity-sbertmodel· 1.6k dl· ♡ 61.6k dl♡ 6
- 🤗l3cube-pune/gujarati-sentence-similarity-sbertmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗l3cube-pune/tamil-sentence-similarity-sbertmodel· 79 dl· ♡ 379 dl♡ 3
- 🤗l3cube-pune/telugu-sentence-similarity-sbertmodel· 56 dl· ♡ 156 dl♡ 1
- 🤗l3cube-pune/odia-sentence-similarity-sbertmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗l3cube-pune/kannada-sentence-similarity-sbertmodel· 16 dl· ♡ 216 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Linear Layer · Dense Connections · Residual Connection · Attention Dropout · Dropout
