ML-driven detection and reduction of ballast information in multi-modal datasets
Yaroslav Solovko

TL;DR
This paper presents a comprehensive framework for detecting and reducing redundant ballast information in multi-modal datasets, leading to more efficient machine learning pipelines with minimal performance loss.
Contribution
It introduces a novel, cross-modal ballast detection and reduction framework utilizing diverse analytical techniques and a new Ballast Score for effective feature pruning.
Findings
Up to 70% of features can be pruned with minimal impact on accuracy.
Significant reductions in training time and memory usage achieved.
Framework identifies different ballast types, guiding efficient data preprocessing.
Abstract
Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Infrastructure Maintenance and Monitoring · Underwater Vehicles and Communication Systems
