TL;DR
This paper introduces optimized overlap-add windows with maximum energy concentration for speech and audio processing, improving signal reconstruction quality by reducing side-lobe artifacts through constrained optimization.
Contribution
It proposes a novel method to optimize overlap-add windows by incorporating the structure as a constraint, enhancing performance over traditional windows.
Findings
Notable reduction in side-lobe magnitude.
Improved signal reconstruction quality.
Effective optimization of low-overlap windows.
Abstract
Processing of speech and audio signals with time-frequency representations require windowing methods which allow perfect reconstruction of the original signal and where processing artifacts have a predictable behavior. The most common approach for this purpose is overlap-add windowing, where signal segments are windowed before and after processing. Commonly used windows include the half-sine and a Kaiser-Bessel derived window. The latter is an approximation of the discrete prolate spherical sequence, and thus a maximum energy concentration window, adapted for overlap-add. We demonstrate that performance can be improved by including the overlap-add structure as a constraint in optimization of the maximum energy concentration criteria. The same approach can be used to find further special cases such as optimal low-overlap windows. Our experiments demonstrate that the proposed windows…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Overlap-Add Windows with Maximum Energy Concentration
for Speech and Audio Processing
Abstract
Processing of speech and audio signals with time-frequency representations require windowing methods which allow perfect reconstruction of the original signal and where processing artifacts have a predictable behavior. The most common approach for this purpose is overlap-add windowing, where signal segments are windowed before and after processing. Commonly used windows include the half-sine and a Kaiser-Bessel derived window. The latter is an approximation of the discrete prolate spherical sequence, and thus a maximum energy concentration window, adapted for overlap-add. We demonstrate that performance can be improved by including the overlap-add structure as a constraint in optimization of the maximum energy concentration criteria. The same approach can be used to find further special cases such as optimal low-overlap windows. Our experiments demonstrate that the proposed windows provide notable improvements in terms of reduction in side-lobe magnitude.
**Index Terms— ** time-frequency processing, windowing, discrete prolate spherical sequences
1 Introduction
Speech and audio signals are slowly time-varying in character, such that it is beneficial to analyze and process them in short segments. When the segment length is chosen appropriately, we can treat the signal as a stationary process within the segment such that statistical modeling becomes efficient. Many applications then use time-frequency transforms on the segments such as the short-time Fourier transform or the modified discrete cosine transform, for the benefit of statistical and perceptual efficiency [1, 2, 3, 4].
Segmenting a signal is a windowing problem, where the segment is extracted by multiplying with a windowing function, which is non-zero in a limited range. In analysis applications, signal processing has a long history in the design of such windowing functions and its theory is presented in every basic book of signal processing, e.g. [5]. The principal objective of windowing in analysis applications is to minimize the detrimental effect of windowing on the signal statistics. In processing applications, however, we also need to consider the effect of windowing on the reconstruction process.
A widely used approach in time-frequency processing of signals is known as overlap-add, where the input signal is windowed into overlapping segments, and after processing, the segments are windowed a second time before adding them together [6, 7, 4] (see Fig. 1). By a careful choice of windowing functions, we can ensure that, in the absence of modifications to the windowed signal, the original signal can be reconstructed from the windowed segments. This is known as the perfect reconstruction property.
Windowing is often discussed in combination with time-frequency transforms, whence the combination is known as a filterbank [8]. A particular type of filterbanks are those which, in addition to perfect reconstruction, also provide critical sampling. The most commonly used critically sampled filterbank in audio processing is the modified discrete cosine transform [9, 10, 11], which can also be applied in a bit-exact manner [12]. Typically, such applications use the half-sine or a Kaiser-Bessel derived (KBD) window, which are some of the few windows applicable in overlap-add. A radically different approach is commonly used in speech coding with code-excited linear prediction (CELP), where temporal correlation is explicitly modeled by a linear predictor, such that the predictor residual can be windowed without overlaps [13, 3].
The performance of windows which are suitable for overlap-add have however not received the same rigorous attention as the classical windowing methods. This paper presents methods for representing the symmetries required by overlap-add as constraints such that the window performance can be optimized. Specifically, we will use the maximum energy concentration criteria [14], familiar from Slepian or discrete prolate spherical sequence (DPSS) -windows, to obtain optimal windows for overlap-add.
2 Overlap-Add Windowing
The objectives, when applying windowing in a processing applications, are two-fold:
-
In the absence of any modifications, we require that the original signal can be reconstructed perfectly.
-
When the windowed signal is modified, then the energy expectation of the modification (or error), in the output signal, should be uniform over time.
Let be our input signal which we want to segment into overlapping windows. Window of the signal is then
[TABLE]
where is the windowing function of length defined as
[TABLE]
We can then apply some processing on the windows such that the modified signal is .
To reconstruct the signal, we apply windowing again by multiplying with the windowing function and add the windows together, such that the modified output signal is
[TABLE]
It is important to observe that the window is applied twice, once on the input signal and a second time after processing on the modified output window. Only after applying the window twice can we add the segments together to obtain the resynthesised signal.
It is well-known and we can readily see that both the requirement of perfect reconstruction and uniform error energy is ensured when the windowing function satisfies the Princen-Bradley criteria [3, 2]
[TABLE]
Figure 2 illustrates a typical windowing function which satisfies the Princen-Bradley criteria and Figure 1 illustrates the effect of overlap-add windowing on a speech signal.
3 Constrained Maximization of
Energy Concentration
Windowing in the time-domain corresponds to convolution in the frequency-domain. To minimize frequency-domain distortion, we therefore require that energy of the windowing function in the frequency-domain is maximally concentrated. The concentration of energy can be evaluated by the ratio of energy in the pass-band versus total energy
[TABLE]
where is the spectrum of the windowing function and is the bandwidth of the pass-band. For discrete, finite length windowing functions it can be shown that the above ratio is equivalent with
[TABLE]
where is a symmetric Toeplitz matrix with elements
[TABLE]
where defines the width of the main lobe. Clearly the maximum of is then the eigenvector of corresponding the largest eigenvalue and we can equivalently define
[TABLE]
The eigenvectors of are known as discrete prolate spherical sequences (DPSS) and the corresponding windowing functions are known correspondingly as DPSS or Slepian windows [14, 15, 16].
The main objective of this paper is to design windowing functions which fulfills those symmetries required by overlap-add processing, while simultaneously optimizing the above spectral characteristics. The Princen-Bradley conditions of Eq. 4 can then be written as
[TABLE]
where is diagonal with diagonal entries . In other words, has two non-zero entries on the diagonal which pick out the th and th samples of the windowing vector . Consequently, the matrices are positive semi-definite. Observe that the constraints Eq. 9 is similar to the constraint in Eq. 8 but more strict. We can therefore define a new optimization problem, using the constraints of Eq. 9 and the objective function of Eq. 8 as
[TABLE]
This is a quadratically constrained quadratic programming (QCQP) problem, which is known to be convex if the matrices and are positive definite. We can therefore use numerical optimization based interior-point methods to find the optimal solution.
4 Low-overlap Windows
In some applications, it is desirable to limit the overlap length between windows [17]. The conventional approach in designing windows of length with overlap , is to choose a windowing function of length and extend it by a vector of ones in the middle, such that the desired length is achieved (see Fig. 3). This heuristic method can now be amended using the optimization presented above.
Specifically, we can define new constraints as
[TABLE]
Substituting these quadratic and linear constraints into the optimization problem of Eq. 10 yields a low-overlap window which has maximal energy concentration.
5 Evaluation
The most commonly used overlap-add windows include the half-sine and a Kaiser-Bessel derived (KBD) window. The half-sine window is defined as
[TABLE]
The KBD window is based on the Kaiser-Bessel window, defined as
[TABLE]
where specifies the width of the main-lobe. The KBD window is then defined as
[TABLE]
In other words, the KBD takes the cumulative sum of the Kaiser-Bessel window, normalizes it by the sum and then takes a square root to satisfy Princen-Bradley.
We generated the proposed DPSS based overlap-add windows (OLA-DPSS) by using the interior-point algorithm of the Optimization toolbox in Matlab2018a. Fig. 4 demonstrates the obtained window shapes for different values of the parameter . As an informal observation, we did not have any problems with convergence and the running times were only some seconds. Since windowing functions are usually determined off-line, we conclude that computational capacity is not an issue in calculation of OLA-DPSS windows.
Figure 5 illustrates the half-sine, KBD and the proposed windows and their spectral responses. Note that we have here manually tuned the pass-band bandwidth ’s in Eqs. 13 and Eq. 7 such that the main-lobe widths match that of the half-sine window. This choice allows fair comparison of the side-lobe magnitudes.
We observe that the KBD window is in shape very similar to the half-sine, and their spectral responses differ only for the second side-lobe and higher. The shape of the proposed OLA-DPSS window, however has slightly higher tails near the ends of the window. Moreover, the spectral response of the OLA-DPSS has an approximately benefit for the first side-lobe. The energy concentration ratios following Eq. 6, for the half-sine, KBD and OLA-DPSS windows are 16.6559, 16.6582 and (parameters as in Fig. 5). In other words, by using OLA-DPSS, we obtain and improvements in energy concentration in comparison to the half-sine and KBD windows respectively.
Figure 6 illustrates low-delay versions of the half-sine, KBD and the proposed windows and their spectral responses. Here we find differences only from the second side-lobe, where the OLA-DPSS is about better than the half-sine and better than KBD. The corresponding energy concentration ratios are 19.6191, 19.6182, indicating that again the OLA-DPSS is the best (by design) but the difference to the others is marginal.
6 Conclusions
Design of windowing functions has a long tradition in signal analysis. In processing of speech and audio signals, we however require that reconstruction of signals is possible. The conventional approach is to use a method known as overlap-add, where subsequent windows are overlapped such their sum recovers the original signal. This places constraints on the window design which has not been adequately taken into account in previous studies.
Slepian windowing functions based on discrete prolate spherical sequences (DPSS) are optimal in terms of energy concentration, whereby we propose to apply the same objective function but with constraints that satisfy the symmetries required by overlap-add. The optimization problem is a quadratically constrained quadratic programming problem, whose solution has become feasible with modern optimization toolboxes. Since windowing functions are usually determined off-line, computational complexity is not an issue.
The presented evaluations confirm that the proposed overlap-add DPSS or OLA-DPSS windows are efficient in energy concentration as desired and the proposed window is better than the conventional windows in all comparisons presented. Since the proposed overlap-add window surpasses the performance of conventional windows in all aspects, indeed it is the optimal window for this application, OLA-DPSS should be the preferred choice in speech and audio processing applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J Benesty, M Sondhi, and Y Huang, Springer Handbook of Speech Processing , Springer, 2008.
- 2[2] M Bosi and R E Goldberg, Introduction to Digital Audio Coding and Standards , Kluwer Academic Publishers, 2003.
- 3[3] T Bäckström, Speech Coding with Code-Excited Linear Prediction , Springer, 2017.
- 4[4] J Vilkamo and T Bäckström, “Time-frequency processing: Methods and tools,” in Parametric Time-Frequency Domain Spatial Audio , V Pulkki, S Delikaris-Manias, and A Politis, Eds., pp. 3–24. Wiley, 2017.
- 5[5] S K Mitra, Digital signal processing: a computer-based approach , Mc Graw-Hill, 1998.
- 6[6] F J Harris, “On the use of windows for harmonic analysis with the discrete Fourier transform,” Proc. IEEE , vol. 66, no. 1, pp. 51–83, 1978.
- 7[7] AH Nuttall, “Spectral analysis by means of overlapped fast Fourier transform processing of windowed data,” Tech. Rep., NUSC Tech. Rep, 4169, 1971.
- 8[8] B Boashash, Time-frequency signal analysis and processing: a comprehensive reference , Academic Press, 2015.
