Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family
Lu\'is Gomes, Ant\'onio Branco, Jo\~ao Silva, Jo\~ao, Rodrigues, Rodrigo Santos

TL;DR
This paper introduces Serafim PT*, a family of open-source Portuguese sentence encoders that achieve state-of-the-art performance and are adaptable to different hardware, with a systematic study on their design choices.
Contribution
It presents a new family of Portuguese sentence encoders with diverse sizes and a comprehensive analysis of their training objectives and parameters.
Findings
State-of-the-art performance across models
Open-source availability for research and commercial use
Insights on learning objectives and parameter selection
Abstract
Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim PT*, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PORTULAN/serafim-900m-portuguese-pt-sentence-encodermodel· 2.6k dl· ♡ 12.6k dl♡ 1
- 🤗PORTULAN/serafim-335m-portuguese-pt-sentence-encodermodel· 1.2k dl1.2k dl
- 🤗PORTULAN/serafim-100m-portuguese-pt-sentence-encodermodel· 717 dl· ♡ 1717 dl♡ 1
- 🤗PORTULAN/serafim-900m-portuguese-pt-sentence-encoder-irmodel· 88k dl· ♡ 588k dl♡ 5
- 🤗PORTULAN/serafim-100m-portuguese-pt-sentence-encoder-irmodel· 7.9k dl· ♡ 17.9k dl♡ 1
- 🤗PORTULAN/serafim-335m-portuguese-pt-sentence-encoder-irmodel· 196k dl196k dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
