L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence   representations using multilingual BERT

Samruddhi Deode; Janhavi Gadre; Aditi Kajale; Ananya Joshi; Raviraj; Joshi

arXiv:2304.11434·cs.CL·April 25, 2023·5 cites

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

Samruddhi Deode, Janhavi Gadre, Aditi Kajale, Ananya Joshi, Raviraj, Joshi

PDF

Open Access 10 Models

TL;DR

This paper introduces L3Cube-IndicSBERT, a simple method to convert multilingual BERT into effective cross-lingual sentence representations, especially for Indian languages, outperforming existing models on similarity tasks.

Contribution

Proposes a straightforward fine-tuning approach using synthetic datasets to create high-quality multilingual sentence embeddings for Indic languages and beyond.

Findings

01

IndicSBERT outperforms LaBSE, LASER, and MPNet on Indic language similarity tasks.

02

The approach works effectively for non-Indic languages like German and French.

03

Monolingual SBERT models perform competitively with IndicSBERT.

Abstract

The multilingual Sentence-BERT (SBERT) models map different languages to common representation space and are useful for cross-language similarity and mining tasks. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We simply aggregate translated NLI or STS datasets of the low-resource target languages together and perform SBERT-like fine-tuning of the vanilla multilingual BERT model. We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach without explicit cross-lingual training yields exceptional cross-lingual properties. We show the efficacy of our approach on 10 major Indic languages and also show the applicability of our approach to non-Indic languages German and French. Using this approach, we further present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Attention Dropout · WordPiece · Dense Connections · Dropout · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia?