# Exploiting user-frequency information for mining regionalisms from   Social Media texts

**Authors:** Juan Manuel P\'erez, Dami\'an E. Aleman, Santiago N. Kalinowski,, Agust\'in Gravano

arXiv: 1907.04492 · 2019-07-11

## TL;DR

This paper introduces a new information-theoretic metric that incorporates user frequency to improve the automatic detection of regionalisms in social media texts, outperforming traditional word frequency methods.

## Contribution

It presents a novel metric based on Information Theory that leverages user frequency, enhancing regionalism detection and lexicographical discovery in social media data.

## Key findings

- The new metric outperforms traditional word frequency methods.
- It helps discover unregistered regional words.
- It improves geolocation of social media users.

## Abstract

The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, and has also heavily depended on the expertise and intuition of the surveyor. The irruption of Social Media and its microblogging services has produced an unprecedented wealth of content, mainly informal text generated by users, opening new opportunities for linguists to extend their studies of language variation. Previous work on automatic detection of regionalisms depended mostly on word frequencies. In this work, we present a novel metric based on Information Theory that incorporates user frequency. We tested this metric on a corpus of Argentinian Spanish tweets in two ways: via manual annotation of the relevance of the retrieved terms, and also as a feature selection method for geolocation of users. In either case, our metric outperformed other techniques based solely in word frequency, suggesting that measuring the amount of users that produce a word is informative. This tool has helped lexicographers discover several unregistered words of Argentinian Spanish, as well as different meanings assigned to registered words.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.04492/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1907.04492/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/1907.04492/full.md

---
Source: https://tomesphere.com/paper/1907.04492