StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors
Inez Okulska, Daria Stetsenko, Anna Ko{\l}os, Agnieszka Karli\'nska,, Kinga G{\l}\k{a}bi\'nska, Adam Nowakowski

TL;DR
StyloMetrix is an open-source multilingual tool that generates stylometric vectors from text, enhancing machine learning and deep learning models for classification tasks across four languages.
Contribution
It introduces a comprehensive, multilingual stylometric feature extraction tool that improves text classification performance in machine learning and deep learning models.
Findings
Effective in supervised content classification with simple algorithms
Enhances embedding layers in Transformer-based models
Proven usefulness across four languages
Abstract
This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Layer Normalization · Label Smoothing · Byte Pair Encoding · Dropout · Softmax
