Creating a level playing field for all symbols in a discretization
Matthew Butler, Dimitar Kazakov

TL;DR
This paper investigates how the Piecewise Aggregate Approximation step in the SAX discretization method alters data distribution, affecting the validity of equal-probability partitions based on the standard normal curve.
Contribution
It reveals that PAA changes data distribution, especially with auto-correlated data, challenging the assumption of using standard normal partitions in SAX.
Findings
PAA causes a shrinking standard deviation in data.
Auto-correlated data is less affected by distribution changes.
Standard normal-based partitions may no longer be valid after PAA.
Abstract
In time series analysis research there is a strong interest in discrete representations of real valued data streams. One approach that emerged over a decade ago and is still considered state-of-the-art is the Symbolic Aggregate Approximation algorithm. This discretization algorithm was the first symbolic approach that mapped a real-valued time series to a symbolic representation that was guaranteed to lower-bound Euclidean distance. The interest of this paper concerns the SAX assumption of data being highly Gaussian and the use of the standard normal curve to choose partitions to discretize the data. Though not necessarily, but generally, and certainly in its canonical form, the SAX approach chooses partitions on the standard normal curve that would produce an equal probability for each symbol in a finite alphabet to occur. This procedure is generally valid as a time series is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Music and Audio Processing · Advanced Text Analysis Techniques
