Including Dialects and Language Varieties in Author Profiling

Alina Maria Ciobanu; Marcos Zampieri; Shervin Malmasi; Liviu P. Dinu

arXiv:1707.00621·cs.CL·July 4, 2017·5 cites

Including Dialects and Language Varieties in Author Profiling

Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

PDF

Open Access

TL;DR

This paper introduces an ensemble SVM-based method for author profiling that considers gender and language variety, achieving high accuracy on multilingual Twitter data.

Contribution

It presents a novel ensemble approach incorporating character and word n-grams for gender and language variety identification in social media texts.

Findings

01

75% accuracy in gender identification on tweets

02

97% accuracy in Portuguese language variety classification

03

Effective use of ensemble SVMs on multilingual datasets

Abstract

This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word $n$ -grams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender identification on tweets written in four languages and 97% accuracy on language variety identification for Portuguese.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Swearing, Euphemism, Multilingualism · Names, Identity, and Discrimination Research

MethodsSupport Vector Machine