Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains
Gonz\'alez Trigueros Jes\'us Eduardo, Alonso S\'anchez Alejandro, Mu\~noz Rivera Emilio, Pe\~nar\'an Prieto Mariana Jaqueline, Mendoza Gonz\'alez Camila Natalia

TL;DR
This paper compares classification strategies across structured and unstructured big data domains, revealing how data heterogeneity affects model performance and guiding algorithm selection based on data type and infrastructure.
Contribution
It introduces a unified framework for selecting classification algorithms tailored to data heterogeneity and infrastructure constraints in big data environments.
Findings
Optimized linear models outperform deep architectures in high-dimensional spaces.
Distributed fine-tuning constraints cause overfitting in complex models for text data.
Feature engineering with Transformer embeddings improves simple model generalization.
Abstract
This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Text and Document Classification Technologies · Big Data and Digital Economy
