Addestramento con Dataset Sbilanciati
Massimiliano Morrelli

TL;DR
This paper compares methods for balancing datasets of short to medium sentences using Apache Spark, aiming to improve sentence classification in distributed environments, with potential applications in big data textual analysis.
Contribution
It evaluates various dataset balancing techniques within Spark to enhance the training of sentence classification models on web-derived data.
Findings
Balanced datasets improve classification accuracy.
Distributed training with Spark is effective for large text data.
Methods tested show varying effectiveness depending on data imbalance.
Abstract
English. The following document pursues the objective of comparing some useful methods to balance a dataset and obtain a trained model. The dataset used for training is made up of short and medium length sentences, such as simple phrases or extracts from conversations that took place on web channels. The training of the models will take place with the help of the structures made available by the Apache Spark framework, the models may subsequently be useful for a possible implementation of a solution capable of classifying sentences using the distributed environment, as described in "New frontier of textual classification: Big data and distributed calculation" by Massimiliano Morrelli et al. Italiano. Il seguente documento persegue l'obiettivo di mettere a confronto alcuni metodi utili a bilanciare un dataset e ottenere un modello addestrato. Il dataset utilizzato per l'addestramento…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Data Management and Algorithms · Advanced Clustering Algorithms Research
