Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects
Pedro Saleiro, Lu\'is Sarmento, Eduarda Mendes Rodrigues, Carlos, Soares, Eug\'enio Oliveira

TL;DR
This study explores the practical challenges of creating large-scale Portuguese Twitter word embeddings, focusing on data volume, vocabulary size, and evaluation metrics, demonstrating scalable training and promising intrinsic evaluation results.
Contribution
It presents a scalable approach to train large vocabulary embeddings from Twitter data using limited hardware, and highlights issues with current evaluation metrics.
Findings
Vocabulary size scaled from 2,048 to 32,768 words.
Training time increased approximately linearly with data size.
Intrinsic evaluation shows promising results for larger vocabularies.
Abstract
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50\% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
