FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts
Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram, Moin

TL;DR
FarSSiBERT is a transformer-based model trained on a large Persian social media dataset, significantly improving semantic similarity measurement for informal texts and providing a new dataset and tokenizer for Persian NLP.
Contribution
This paper introduces FarSSiBERT, a novel Persian transformer model trained on 104 million social media texts, along with a new annotated dataset and specialized tokenizer for informal language.
Findings
FarSSiBERT outperforms ParsBERT, laBSE, and multilingual BERT in similarity tasks.
The model effectively handles colloquial Persian texts from social networks.
The dataset FarSSiM and the tokenizer improve NLP tasks on informal Persian texts.
Abstract
One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · WordPiece · Weight Decay · Attention Dropout · Residual Connection · Adam · Linear Layer · Layer Normalization
