SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?
Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Chandra Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, Sarana Nutanong

TL;DR
This paper introduces SEA-BED, a comprehensive benchmark for Southeast Asian languages, revealing that multilingual embedding models perform inconsistently across languages and tasks, challenging assumptions of stable semantic representations.
Contribution
The paper presents SEA-BED, a large-scale benchmark for SEA languages, and systematically evaluates embedding models, highlighting their uneven performance across languages and tasks.
Findings
No single model performs well across all SEA languages
Task difficulty varies significantly within languages
Performance varies across language-task combinations
Abstract
Multilingual text embeddings are often assumed to encode meaning in a perspective-independent semantic space, yielding stable similarity judgments across tasks and languages. Our results show that this assumption does not hold in practice. We introduce SEA-BED, a large-scale benchmark covering 10 Southeast Asian (SEA) languages and diverse embedding tasks, designed to systematically examine how embedding performance varies across tasks, languages, and language-task combinations. Across extensive evaluations, we observe that no single model performs uniformly well across SEA languages; task difficulty differs markedly within languages, and success on one task does not reliably generalize to others. Language-task analyses further reveal highly non-uniform performance landscapes, where performance varies across different language-task combinations. These findings call for closer attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗aisingapore/SEA-LION-ModernBERT-300M-checkpointsmodel
- 🤗aisingapore/SEA-LION-ModernBERT-600M-checkpointsmodel
- 🤗aisingapore/SEA-LION-ModernBERT-Embedding-300Mmodel· 130 dl· ♡ 1130 dl♡ 1
- 🤗aisingapore/SEA-LION-ModernBERT-Embedding-600Mmodel· 130 dl130 dl
- 🤗aisingapore/SEA-LION-E5-Embedding-600Mmodel· 1.3k dl· ♡ 21.3k dl♡ 2
- 🤗aisingapore/SEA-LION-ModernBERT-300Mmodel· 196 dl196 dl
- 🤗aisingapore/SEA-LION-ModernBERT-600Mmodel· 136 dl136 dl
- 🤗minhnguyent546/SEA-LION-ModernBERT-Embedding-300Mmodel· 5 dl5 dl
- 🤗minhnguyent546/SEA-LION-E5-Embedding-600Mmodel· 5 dl5 dl
- 🤗minhnguyent546/SEA-LION-ModernBERT-Embedding-600Mmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
