LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with   Knowledge Distillation

Zhuoyuan Mao; Tetsuji Nakagawa

arXiv:2302.08387·cs.CL·December 27, 2023

LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Zhuoyuan Mao, Tetsuji Nakagawa

PDF

Open Access 3 Models

TL;DR

This paper introduces LEALLA, a lightweight, language-agnostic sentence embedding model that uses knowledge distillation to achieve competitive performance across 109 languages while reducing inference overhead.

Contribution

The paper proposes a novel lightweight model architecture and distillation techniques for efficient, multilingual sentence embeddings, outperforming existing large models in speed and resource usage.

Findings

01

LEALLA achieves strong performance on multiple benchmarks.

02

The lightweight model significantly reduces inference time.

03

Knowledge distillation improves embedding quality.

Abstract

Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings