Towards Semantic Noise Cleansing of Categorical Data based on Semantic Infusion
Rishabh Gupta, Rajesh N Rao

TL;DR
This paper introduces a novel semantic infusion method to identify and remove semantic noise from categorical text data, enhancing text understanding for domain-specific analytics.
Contribution
It formalizes semantic noise, proposes a semantic infusion technique, and develops an unsupervised framework for noise filtering based on term context.
Findings
Semantic infusion associates meta-data with text effectively.
The framework reduces semantic noise while preserving narrative.
Evaluation shows improved text clarity in automobile forum data.
Abstract
Semantic Noise affects text analytics activities for the domain-specific industries significantly. It impedes the text understanding which holds prime importance in the critical decision making tasks. In this work, we formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text. We look beyond the notion of standard statistically-based stop words and consider the semantics of terms to exclude the semantic noise. We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text and demonstrate its near-lossless nature. Based on this technique, we propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms. Later we present the evaluation results of the proposed framework using a web forum dataset from the automobile-domain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
