# Development of email classifier in Brazilian Portuguese using feature   selection for automatic response

**Authors:** Rogerio Bonatti, Arthur Gola de Paula

arXiv: 1907.04905 · 2019-07-12

## TL;DR

This paper develops a Brazilian Portuguese email classifier using feature selection and compares different text preprocessing techniques, achieving up to 87.3% accuracy with SVM and POS filtering.

## Contribution

It introduces a novel corpus for Brazilian Portuguese email classification and evaluates the impact of lemmatization and POS filtering on classifier performance.

## Key findings

- SVM with POS filtering achieved 87.3% accuracy.
- Lemmatization reduced classification precision and recall.
- POS filtering improved overall classification results.

## Abstract

Automatic email categorization is an important application of text classification. We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.

---
Source: https://tomesphere.com/paper/1907.04905