Gender Prediction Based on Vietnamese Names with Machine Learning Techniques
Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan, Nguyen

TL;DR
This paper introduces a new Vietnamese name dataset and compares multiple machine learning and deep learning models, achieving up to 96% F1-score in gender prediction, and provides a web API for practical use.
Contribution
It presents the first comprehensive Vietnamese name dataset for gender prediction and evaluates various models, including deep learning, for improved accuracy.
Findings
LSTM with fastText achieves 96% F1-score
The dataset contains over 26,000 annotated names
Multiple models are compared for effectiveness
Abstract
As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsfastText · Tanh Activation · Sigmoid Activation · Long Short-Term Memory
