Distance Functions and Normalization Under Stream Scenarios
Eduardo V. L. Barboza, Paulo R. Lisboa de Almeida, Alceu de Souza Britto Jr, Rafael M. O. Cruz

TL;DR
This paper evaluates the effectiveness of different distance functions and normalization strategies in data stream classification, highlighting that using original data with Canberra distance often yields good results.
Contribution
It compares eight distance functions under various normalization scenarios in data streams, emphasizing the limitations of full-stream normalization protocols.
Findings
Using original data without normalization can be effective.
Canberra distance performs well without prior data normalization.
Full-stream normalization protocols may lead to biased results.
Abstract
Data normalization is an essential task when modeling a classification system. When dealing with data streams, data normalization becomes especially challenging since we may not know in advance the properties of the features, such as their minimum/maximum values, and these properties may change over time. We compare the accuracies generated by eight well-known distance functions in data streams without normalization, normalized considering the statistics of the first batch of data received, and considering the previous batch received. We argue that experimental protocols for streams that consider the full stream as normalized are unrealistic and can lead to biased and poor results. Our results indicate that using the original data stream without applying normalization, and the Canberra distance, can be a good combination when no information about the data stream is known beforehand.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
