High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Julian Rodriguez; Piotr Lopez; Emiliano Lerma; Rafael Medrano; Jacobo Hernandez

arXiv:2512.10312·cs.DC·December 12, 2025

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Julian Rodriguez, Piotr Lopez, Emiliano Lerma, Rafael Medrano, Jacobo Hernandez

PDF

Open Access

TL;DR

This paper presents a comprehensive benchmarking of machine learning and deep learning architectures for high-dimensional data processing in local and distributed environments, emphasizing practical workflows and technical implementations.

Contribution

It introduces a detailed methodology for benchmarking ML/DL architectures on high-dimensional data using distributed computing with Apache Spark.

Findings

01

Effective workflows for high-dimensional data analysis

02

Performance insights of ML/DL architectures in distributed settings

03

Implementation guidelines for Spark-based data processing

Abstract

This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Big Data and Digital Economy · Scientific Computing and Data Management