L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking   BERT Sentence Representations for Hindi and Marathi

Ananya Joshi; Aditi Kajale; Janhavi Gadre; Samruddhi Deode; Raviraj; Joshi

arXiv:2211.11187·cs.CL·November 23, 2022·1 cites

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Ananya Joshi, Aditi Kajale, Janhavi Gadre, Samruddhi Deode, Raviraj, Joshi

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper develops and benchmarks sentence-BERT models for Hindi and Marathi using synthetic datasets, demonstrating their effectiveness over existing multilingual models in low-resource language tasks.

Contribution

It introduces L3Cube-MahaSBERT and HindSBERT, the first high-performance sentence-BERT models for Marathi and Hindi, trained with synthetic datasets, and provides a comprehensive benchmarking analysis.

Findings

01

Synthetic data training yields high-quality embeddings for low-resource languages.

02

The proposed models outperform complex multilingual models like LaBSE.

03

Embeddings generalize well to real downstream tasks.

Abstract

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

l3cube-pune/MarathiNLP
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Linear Layer · Dense Connections · Residual Connection · Attention Dropout · Dropout