Big enterprise registration data imputation: Supporting spatiotemporal analysis of industries in China
Fa Li, Zhipeng Gui, Huayi Wu, Jianya Gong, Yuan Wang, Siyu Tian,, Jiawen Zhang

TL;DR
This paper presents a scalable HPC-based framework using Apache Spark and machine learning to impute missing and ambiguous data in large enterprise registration datasets, enabling detailed spatiotemporal industry analysis in China.
Contribution
It introduces a novel big data imputation workflow combining external data, NLP, and machine learning within HPC environments for enterprise data.
Findings
The framework is feasible and efficient for large-scale data imputation.
Imputation improves data quality for accurate spatiotemporal analysis.
The approach is scalable and adaptable to other georeferenced text data.
Abstract
Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data quality issues like incompleteness and ambiguity, hinder such analysis and application. These issues become more challenging when the volume of data is immense and constantly growing. High Performance Computing (HPC) frameworks can tackle big data computational issues, but few studies have systematically investigated imputation methods for enterprise registration data in this type of computing environment. In this paper, we propose a big data imputation workflow based on Apache Spark as well as a bare-metal computing cluster, to impute enterprise registration data. We integrated external data sources, employed Natural Language Processing (NLP), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
