Automated Detection of Non-Relevant Posts on the Russian Imageboard   "2ch": Importance of the Choice of Word Representations

Amir Bakarov; Olga Gureenkova

arXiv:1707.04860·cs.CL·January 23, 2018

Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations

Amir Bakarov, Olga Gureenkova

PDF

1 Repo

TL;DR

This paper investigates how different word embedding models affect the automated detection of non-relevant posts on Russian forums by comparing their performance on a specially created dataset.

Contribution

It introduces a comparative analysis of seven word embedding models for semantic relatedness detection in Russian forum posts, highlighting the importance of embedding choice.

Findings

01

FastText and Swivel outperform other models in relatedness detection

02

The dataset reveals challenges due to Russian lexical and grammatical features

03

Word embedding choice significantly impacts detection accuracy

Abstract

This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bakarov/2ch2vec
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.