SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic   Speech Processing

Weidong Chen; Xiaofen Xing; Xiangmin Xu; Jianxin Pang; Lan Du

arXiv:2302.14638·eess.AS·March 1, 2023

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du

PDF

1 Repo

TL;DR

SpeechFormer++ is a hierarchical framework that effectively models speech properties for paralinguistic tasks, outperforming standard Transformers with lower computational costs.

Contribution

It introduces a structure-based hierarchical framework that captures speech characteristics and balances fine- and coarse-grained features for improved paralinguistic speech processing.

Findings

01

Outperforms standard Transformer in speech emotion recognition, depression classification, and Alzheimer's detection.

02

Reduces computational cost significantly compared to Transformer.

03

Achieves state-of-the-art results on multiple paralinguistic speech tasks.

Abstract

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different…

Tables14

Table 1. TABLE I: Performance and computational efficiency of Transformer and SpeechFormer++ using HuBERT features on IEMOCAP. Gain indicates the relative improvement (+) or reduction (-)

Method	Params	FLOPs	WA	UA	WF1
Transformer	63.64M	23.12G	0.685	0.701	0.692
SpeechFormer++	66.79M	6.55G	0.705	0.715	0.707
Gain	+4.95%	-71.67%	+2.92%	+2.00%	+2.17%

Table 2. TABLE II: Comparison with state-of-the-art methods on IEMOCAP. All systems apply audio as input for a fair and direct comparison. h/c=hand-crafted, w2v2=wav2vec 2.0

Method	Features	Year	WA	UA	WF1
STC[14]	H/C	2021	0.613	0.604	0.617
^$†$ ISNet[15]	H/C	2022	0.704	0.650	-
LSTM-GIN[17]	H/C	2021	0.647	0.655	-
SUPERB[77]	w2v2	2021	0.656	-	-
SUPERB[77]	HuBERT	2021	0.676	-	-
CA-MSER[54]	H/C + w2v2	2022	0.698	0.711	-
SpeechFormer++	H/C	2022	0.645	0.658	0.649
SpeechFormer++	HuBERT	2022	0.705	0.715	0.707

Table 3. TABLE III: Performance and computational efficiency of Transformer and SpeechFormer++ using HuBERT features on MELD. Gain indicates the relative improvement (+) or reduction (-)

Method	Params	FLOPs	WA	UA	WF1
Transformer	63.64M	15.33G	0.485	0.257	0.454
SpeechFormer++	66.79M	4.51G	0.510	0.273	0.470
Gain	+4.95%	-70.58%	+5.16%	+6.23%	+3.52%

Table 4. TABLE IV: Comparison with state-of-the-art methods on MELD. All systems apply audio as input for a fair and direct comparison. h/c=hand-crafted, w2v2=wav2vec 2.0

Method	Features	Year	WA	UA	WF1
^$†$ ConGCN[78]	H/C	2019	-	-	0.422
MMFA-RNN[19]	H/C	2020	0.488	-	0.423
^$†$ MM-DFN[18]	H/C	2022	-	-	0.427
^$†$ CTNet[37]	H/C	2021	0.469	-	0.382
Sharma[57]	w2v2	2022	0.498	-	-
SpeechFormer++	H/C	2022	0.480	0.236	0.429
SpeechFormer++	HuBERT	2022	0.510	0.273	0.470

Table 5. TABLE V: Performance and computational efficiency of Transformer and SpeechFormer++ using HuBERT features on Pitt. Gain indicates the relative improvement (+) or reduction (-)

Method	Params	FLOPs	WA	UA	WF1
Transformer	63.64M	23.28G	0.789	0.790	0.780
SpeechFormer++	66.79M	6.58G	0.813	0.816	0.808
Gain	+4.95%	-71.74%	+3.04%	+3.29%	+3.59%

Table 6. TABLE VI: Comparison with state-of-the-art methods on Pitt. All systems apply audio as input for a fair and direct comparison. h/c=hand-crafted, w2v2=wav2vec 2.0

Method	Features	Year	WA	UA	WF1
GCNN[10]	H/C	2018	0.736	-	-
Makiuchi[11]	H/C	2021	0.731	0.731	0.732
Autoencoder[20]	H/C	2022	0.739	0.641	0.621
Pérez-Toro[79]	w2v2	2022	-	-	0.720
Monica[33]	HuBERT	2022	0.740	0.740	0.745
SpeechFormer++	H/C	2022	0.742	0.738	0.727
SpeechFormer++	HuBERT	2022	0.813	0.816	0.808

Table 7. TABLE VII: Performance and computational efficiency of Transformer and SpeechFormer++ using HuBERT features on DAIC-WOZ. Gain indicates the relative improvement (+) or reduction (-)

Method	Params	FLOPs	WA	UA	MF1
Transformer	63.64M	31.26G	0.686	0.661	0.658
SpeechFormer++	66.79M	8.53G	0.771	0.726	0.709
Gain	+4.95%	-72.71%	+12.39%	+9.83%	+7.75%

Table 8. TABLE VIII: Comparison with state-of-the-art methods on DAIC-WOZ. All systems apply audio as input for a fair and direct comparison. h/c=hand-crafted, w2v2=wav2vec 2.0

Method	Features	Year	WA	UA	MF1
FVTC-CNN[13]	H/C	2020	0.735	0.656	0.640
EmoAudioNet[12]	H/C	2021	0.732	0.649	0.653
Saidi[80]	H/C	2020	0.680	0.680	0.680
Solieman[81]	H/C	2021	0.660	0.615	0.610
SIMSIAM-S[82]	HuBERT	2022	0.703	-	-
TOAT[56]	w2v2	2022	0.717	0.429	0.480
SpeechFormer++	H/C	2022	0.743	0.754	0.733
SpeechFormer++	HuBERT	2022	0.771	0.726	0.709

Table 9. TABLE IX: Performance of different approaches using HuBERT features. F1 stands for MF1 on DAIC-WOZ, and WF1 on other datasets.

Dataset	Method	WA	UA	F1
IEMOCAP	SUPERB[77]	0.676	-	-
	^$†$ STC[14]	0.657	0.659	0.643
	^$†$ LSTM-GIN[17]	0.661	0.669	0.662
	SpeechFormer++	0.705	0.715	0.707
MELD	^$‡$ MM-DFN[18]	0.477	0.254	0.458
	^$†$ MMFA-RNN[19]	0.500	0.243	0.446
	SpeechFormer++	0.510	0.273	0.470
Pitt	Monica[33]	0.740	0.740	0.745
	^$†$ Makiuchi[11]	0.794	0.791	0.774
	SpeechFormer++	0.813	0.816	0.808
DAIC-WOZ	SIMSIAM-S[82]	0.703	-	-
	^$†$ FVTC-CNN[13]	0.714	0.703	0.694
	^$‡$ EmoAudioNet[12]	0.686	0.701	0.676
	SpeechFormer++	0.771	0.726	0.709

Table 10. TABLE X: Ablation study on unit encoder (UE). Gain indicates the relative improvement (+) or reduction (-)

UE	IEMOCAP		MELD	Pitt		DAIC-WOZ
UE	WA	UA	WF1	WA	UA	MF1
w/o	0.686	0.694	0.457	0.796	0.803	0.680
w/	0.705	0.715	0.470	0.813	0.816	0.709
Gain	+2.77%	+3.03%	+2.84%	+2.14%	+1.62%	+4.26%
FOLPs
w/o	6.77G		4.55G	6.81G		9.09G
w/	6.55G		4.51G	6.58G		8.53G
Gain	-3.25%		-0.88%	-3.38%		-6.16%

Table 11. TABLE XI: Ablation study on word encoder (WE). Gain indicates the relative improvement (+) or reduction (-)

WE	IEMOCAP		MELD	Pitt		DAIC-WOZ
WE	WA	UA	WF1	WA	UA	MF1
w/o	0.701	0.709	0.464	0.806	0.805	0.679
w/	0.705	0.715	0.470	0.813	0.816	0.709
Gain	+0.57%	+0.85%	+1.29%	+0.87%	+1.37%	+4.42%
FOLPs
w/o	6.08G		4.18G	6.12G		7.93G
w/	6.55G		4.51G	6.58G		8.53G
Gain	+7.73%		+7.89%	+7.52%		+7.57%

Table 12. TABLE XII: Ablation study on merging Block (MB). Gain indicates the relative improvement (+) or reduction (-)

MB	IEMOCAP		MELD	Pitt		DAIC-WOZ
MB	WA	UA	WF1	WA	UA	MF1
w/o	0.696	0.708	0.464	0.810	0.813	0.672
w/	0.705	0.715	0.470	0.813	0.816	0.709
Gain	+1.29%	+0.99%	+1.29%	+0.37%	+0.37%	+5.51%
FOLPs
w/o	22.17G		15.05G	22.31G		29.32G
w/	6.55G		4.51G	6.58G		8.53G
Gain	-70.46%		-70.03%	-70.51%		-70.91%

Table 13. TABLE XIII: Performances of finetuning the pretrained model with a simple MLP and learning further deep representation with SpeechFormer++. MLP = Multilayer Perceptron

Method	IEMOCAP		MELD	Pitt		DAIC-WOZ
Method	WA	UA	WF1	WA	UA	MF1
^$†$ MLP (FT)	0.677	0.689	0.455	0.793	0.789	0.676
^$♮$ Transformer	0.680	0.699	0.457	0.798	0.793	0.665
^$‡$ SpeechFormer++	0.695	0.702	0.468	0.805	0.805	0.746
SpeechFormer++	0.705	0.715	0.470	0.813	0.816	0.709

Table 14. TABLE XIV: Performances of adopting attention mechanisms from computer vision on four corpora. Window-based Swin and cluster-based BOAT algorithms are considered

Method	IEMOCAP		MELD	Pitt		DAIC-WOZ
Method	WA	UA	WF1	WA	UA	MF1
Swin-T[28]	0.623	0.631	0.422	0.741	0.741	0.669
BOAT-T[29]	0.627	0.627	0.419	0.755	0.747	0.683
Ours	0.705	0.715	0.470	0.813	0.816	0.709
FOLPs
Swin-T[28]	11.50G		11.48G	11.50G		11.50G
BOAT-T[29]	15.01G		14.99G	15.01G		15.01G
Ours	6.55G		4.51G	6.58G		8.53G

Equations32

[x_{i 1}, x_{i 2}, \dots, x_{i T_{i}}]

[x_{i 1}, x_{i 2}, \dots, x_{i T_{i}}]

x_{ij}

\overset{x}{^}_{i}^{(j)} = N or m (M S A (x_{i}^{(j)}, x_{ij}, x_{ij}) + x_{i}^{(j)})

\overset{x}{^}_{i}^{(j)} = N or m (M S A (x_{i}^{(j)}, x_{ij}, x_{ij}) + x_{i}^{(j)})

\overset{x}{^}_{i} = C o n c a t (\overset{x}{^}_{i}^{(1)}, \overset{x}{^}_{i}^{(2)}, \dots, \overset{x}{^}_{i}^{(T_{i})})

\overset{x}{^}_{i} = C o n c a t (\overset{x}{^}_{i}^{(1)}, \overset{x}{^}_{i}^{(2)}, \dots, \overset{x}{^}_{i}^{(T_{i})})

[s_{i 1}, s_{i 2}, \dots, s_{i T_{z}}]

[s_{i 1}, s_{i 2}, \dots, s_{i T_{z}}]

s_{ij}

\overset{z}{ˉ}_{i}^{(j)} = M S A (z_{i}^{(j)}, s_{ij}, s_{ij})

\overset{z}{ˉ}_{i}^{(j)} = M S A (z_{i}^{(j)}, s_{ij}, s_{ij})

e_{i}^{j k} = C o n c a t (\overset{z}{ˉ}_{i}^{(k)}, x_{ij}), k = C e i l [\frac{j \times T _{z}}{T _{i}}]

e_{i}^{j k} = C o n c a t (\overset{z}{ˉ}_{i}^{(k)}, x_{ij}), k = C e i l [\frac{j \times T _{z}}{T _{i}}]

\overset{x}{^}_{i}^{(j)} = N or m (M S A (x_{i}^{(j)}, e_{i}^{j k}, e_{i}^{j k}) + x_{i}^{(j)})

\overset{x}{^}_{i}^{(j)} = N or m (M S A (x_{i}^{(j)}, e_{i}^{j k}, e_{i}^{j k}) + x_{i}^{(j)})

\overset{x}{ˉ}_{i} = N or m (F F N (\overset{x}{^}_{i}) + \overset{x}{^}_{i})

\overset{x}{ˉ}_{i} = N or m (F F N (\overset{x}{^}_{i}) + \overset{x}{^}_{i})

x_{i + 1}

x_{i + 1}

z_{i + 1}

C C E = - \frac{1}{S} s = 1 \sum S c = 1 \sum C y_{sc} lo g_{2} (\overset{y}{^}_{sc})

C C E = - \frac{1}{S} s = 1 \sum S c = 1 \sum C y_{sc} lo g_{2} (\overset{y}{^}_{sc})

Ω (M S A) = 4 T d^{2} + 2 T^{2} d

Ω (M S A) = 4 T d^{2} + 2 T^{2} d

Ω (S \mbox - M S A) = 4 (T + T_{z}) d^{2} + 2 T (T_{w} + 2) d

Ω (S \mbox - M S A) = 4 (T + T_{z}) d^{2} + 2 T (T_{w} + 2) d

W A

W A

U A

W F 1

M F 1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

happycolor/speechformer2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Label Smoothing · Softmax · Adam · Layer Normalization · Residual Connection · Dense Connections

Full text

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, and Lan Du This work was supported in part by the National Key R&D Program of China under Grant 2022YFB4500600; in part by the National Natural Science Foundation of China under Grant U1801262; in part by the Science and Technology Project of Guangzhou under Grant 202103010002; in part by the Science and Technology Project of Guangdong under Grant 2022B0101010003; in part by the Natural Science Foundation of Guangdong Province under Grant 2022A1515011588; and in part by the Guangdong Provincial Key Laboratory of Human Digital Twin under Grant 2022B1212010004. *(Corresponding authors: Xiangmin Xu; Xiaofen Xing.)*Weidong Chen and Xiaofen Xing are with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China (e-mail: [email protected]; [email protected]). Xiangmin Xu is with the School of Future Technology, South China University of Technology, Guangzhou 511442, China, and also with Pazhou Laboratory, Guangzhou 510330, China (e-mail: [email protected]). Jianxin Pang is with UBTECH Research, UBTECH Robotics Corporation, Shenzhen 518055, China (e-mail: [email protected]). Lan Du is with iFLYTEK Research, iFLYTEK Corporation, Hefei 230088, China (e-mail: [email protected]).

Abstract

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer’s disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.

Index Terms:

Transformer, paralinguistic speech processing, speech emotion recognition, neurocognitive disorder detection.

I Introduction

Speech has been used over thousands of years in human society and is able to convey the most information in the simplest way[1]. Paralinguistic speech processing (PSP), which aims to extract information beyond the linguistic content of speech, such as sentiment, depression and neurocognition, has a wide range of applications in different areas. Consequently, it is of increasing interest to the research community.

Modeling speech signals in PSP is a great challenge because the pronunciation information and the dynamic changes of speech are well understood by humans but are difficult for models to comprehend. Over the last three decades, numerous machine learning algorithms, such as hidden Markov models [2, 3, 4], decision trees [5, 6] and restricted Boltzmann machines [7, 8, 9], have been proposed to capture paralinguistic information in speech. Recently, deep learning methods have delivered superior performance for PSP tasks owing to their remarkable modeling capabilities. For example, convolutional neural networks (CNNs) [10, 11, 12, 13, 14, 15, 16], graph neural networks (GNNs) [17, 18], recurrent neural networks (RNNs) [19, 20, 21] and two popular variants of the RNNs named long short-term memory (LSTM) [22, 23, 24] and gated recurrent units (GRUs) [25] have achieved promising results in PSP domain.

Transformer [26], is the latest deep learning technique and was originally used in the natural language processing field (NLP). It has achieved great success using full attention to model the long-range dependencies in a sequence. Currently, there is a growing body of literature that recognises the value of Transformer, especially in the computer vision domain (CV) [27, 28, 29, 30, 31, 32]. When researchers adapt the Transformer from language to vision, the characteristics of the input signal (i.e. image) are considered and used as a guide to modify the attention mechanism. In essence, the information contained in a language signal is dense, while that in an image is sparse. Hence, full attention in the standard Transformer is no longer appropriate. As shown in Fig. 1(a) and 1(b), to bridge the gap between language and vision, researchers in vision prefer to use local windows to capture the local information in images. However, using local windows may cause the same object in the image to be separated. To allow connection between different windows, prior works have mainly explored two directions: (1) Shift the attention windows between layers such that the neighboring windows are linked together [28]. (2) Cluster the input tokens and perform local attention in feature space [29, 30]. For the PSP task, there has also been extensive research regarding the use of Transformer [33, 34, 35, 36, 37, 38]. Despite the improvements made, little attention has been paid to the characteristic of the input signal (i.e. speech), as has been done in computer vision, which is however crucial for extracting paralinguistic information. Therefore, there is an urgent need to exploit the potential of Transformer in PSP by incorporating the features of speech signals. Meanwhile, the standard Transformer is computationally expensive due to its full attention, which makes it difficult to use in practice.

To address these drawbacks, we should rethink the natural structure of speech signals first. As shown in Fig. 1(c), we note that a speech signal can be perceived from different perspectives, allowing the information to have different granularities. We refer to this property as the hierarchical relationships of speech. The basic acoustic units that construct the speech signal, from fine to coarse, are frames, phones and words. On the other hand, a word consists of several consecutive phones, and a phone is composed of several consecutive frames. This connection between frames, phones and words is termed the component relationship. Armed with these implicit relationships, we propose a novel attention mechanism for the PSP task, which is illustrated in Fig. 1(c). First, we perform, with high computational efficiency, local attention to model the adjacent tokens that belong to the same unit. Statistical durations of phones and words are used to ensure that each local window completely covers the central unit, allowing the boundary information of the central unit to be completely preserved and not separated. The interactions between neighboring units are also considered to comprehensively simulate the inter-unit information. After sufficient learning has been performed in the current stage, we utilize a merging block to aggregate the feature tokens and feed them into the next stage, thus enabling our framework to follow the hierarchical structure of the speech signal. Moreover, to further enrich the information contained in each token, we introduce several learnable word tokens to appropriately incorporate the coarse-grained features into the fine-grained features. The contributions of this paper can be summarized as follows:

•

Based on the component relationship in speech, we propose a unit encoder to capture the intra- and inter-unit information efficiently. To further enhance the extracted features, we utilize a word encoder to effectively integrate the coarse- and fine-grained information.

•

Based on the hierarchical relationship in speech, we construct a hierarchical backbone, called SpeechFormer++, for paralinguistic speech processing. To the best of our knowledge, this is the first study that leverages the characteristics of speech to exploit the potential of Transformer.

•

We evaluate our method on four benchmark datasets and demonstrate that our SpeechFormer++ substantially outperforms the standard Transformer in terms of performance and computational efficiency. Moreover, SpeechFormer++ achieves superior results compared to the state-of-the-art approaches. Our code is publicly available at https://github.com/HappyColor/SpeechFormer2.

A preliminary version of this work was published in [39]. We have extended our conference version as follows. In terms of the model structure, (1) we introduce an additional encoder to balance the fine-grained and the coarse-grained information efficiently and achieve superior performance; (2) we investigate the sensitivity of SpeechFormer++ to the statistical durations and offer guidance on applying SpeechFormer++ to various scenarios. In terms of verifying the effectiveness of the framework, (1) we report results of SpeechFormer++ with pretrained features and hand-crafted features to demonstrate the adaptability of our approach; (2) we compare SpeechFormer++ and other approaches using the same set of features to release the impact of input features; (3) we compare the performances of finetuning the pretrained model directly with several dense layers and learning further deep representation with SpeechFormer++ to prove the importance of our framework; (4) we adopt attention mechanisms from computer vision to demonstrate the necessity of considering the features of speech when modeling speech signal. In terms of verifying the effectiveness of every module, (1) we conduct a comprehensive ablation study to analyze the indispensability of each module in SpeechFormer++. In terms of model understanding and interpretation, (1) we visualize the attention weights of Transformer and SpeechFormer++ to determine the reasons behind the improvements; (2) we add more evaluation metrics to each dataset for better justification of results.

The rest of this paper is organized as follows: In Section II, we provide a brief literature review on Transformer and structure-based paralinguistic speech processing. In Section III, we elaborate on the proposed SpeechFormer++ framework. In Section IV, we describe the experimental corpora and setup in detail. In Section V, we present our experimental results and analyses. Finally, we draw our conclusions in Section VI.

II Related Work

In this section, we systematically introduce the applications of Transformer in different fields and the related research on structure-based paralinguistic speech modeling.

II-A Transformer in Language Processing and Computer Vision

The original Transformer is designed to tackle machine translation tasks in the natural language processing field [26]. As a sequence learning model, Transformer is excellent at modeling the long-range dependencies, while being purely based on the attention mechanism, it dispenses with the recursion and convolution and computes hidden representations in parallel. In general, the raw text signal is first converted into a word embedding sequence via the word and position embedding layers. Then, the output is delivered to a stack of Transformer encoders to produce the final embedding, followed by several Transformer decoders or a task-specific classifier. The Transformer has been applied to various NLP tasks, including question answering [40], named entity recognition [41], natural language inference [42], semantic textual similarity [43] and document classification [44].

In computer vision, an image is first split into several fixed-size (e.g. $16\times 16$ ) patches, followed by a linear projection and a positional embedding layer to yield the input for the Transformer [27]. Inspired by CNNs that can be improved by stacking more convolutional layers, researchers have attempted to increase the depth of vision Transformer and solve the attention collapse issue encountered when the model goes deeper [45]. Nevertheless, the self-attention operation scales quadratically with the sequence length, making Transformer computationally expensive and unable to handle numerous tokens in high-resolution images. Consequently, the bulk of the literature has been devoted to enhancing the attention mechanism used to exploit the potential of Transformer in computer vision [28, 29, 30, 31, 32]. Typically, Rao et al. [32] evaluated the importance of each token and dropped the useless tokens dynamically and progressively. Liu* et al.* [28] proposed a hierarchical Transformer and performed attention within each shifted window, which greatly reduced the computational cost while also allowing for cross-window interactions. Although this method boosts efficiency, it fails to capture the relationships between distant but similar patches in the image due to the constraint of the window size. To address this limitation, [29] first clustered the tokens and then computed self-attention among the related tokens in feature space.

II-B Transformer in Paralinguistic Speech Processing

There has also been much interest in applying the standard Transformer in the speech domain. Generally, the given raw speech signal is segmented into multiple overlapping frames. Then, the spectral or deep learning-based features are extracted from each frame and used as input for the Transformer [36, 35, 34, 46]. In [35], stacked multiple Transformer layers were explored to enhance the features extracted for speech emotion recognition. In [36], researchers followed the structure of Swin Transformer [28] and cut the spectrogram into different patch tokens for sound classification and detection. Although these works demonstrate their effectiveness, they mainly adopt Transformer directly, ignoring the characteristics of speech and task. To overcome this problem, efficient emotion recognition was implemented in [34] by utilizing a sparse Transformer to focus more on the emotion-related information. In [47], an auditory saliency mechanism was studied and applied in a Transformer to adjust the feature embeddings. Additionally, Transformer has been employed for Alzheimer’s disease (AD) detection [48, 49] and depression classification [50]. For example, Ilias et al. [48] utilized a pretrained vision Transformer [27] to extract acoustic features and achieved remarkable results for dementia detection. Later, Zhu et al. [49] sought an effective integration of semantic and non-semantic speech information. Both [48] and [49] encouraged the use of pretrained models. In [50], a Transformer-based network was utilized to extract long-term temporal context information for depression estimation. Based on Transformer, various self-supervised speech representation learning approaches have also been proposed, including wav2vec [51], wav2vec 2.0 [52] and HuBERT [53]. Built on the pretrained self-supervised models, several researches have delivered promising results in the literature [54, 33, 55, 34, 49, 39, 56, 57]. Typically, Monica et al. [33] fine-tuned the pretrained HuBERT model for AD detection and achieved competitive performance. In addition to the paralinguistic tasks, Transformer is also broadly used in speech recognition field [58, 59, 60]. Typically, Wang et al. [59] explored the potential of Transformer-based acoustic models on hybrid speech recognition and achieved significant word error rate improvement over the conventional baselines. Gulati et al. [60] novelly proposed a convolution-augmented Transformer, called Conformer, to learn both global interactions and local features effectively.

II-C Structure-Based Paralinguistic Speech Modeling

The speech signal is structured by different basic units, from fine to coarse, which are frames, phones and words. This natural structure is unique to speech and contains extensive paralinguistic information such as fluency, articulation, prolongation and rhythm. Some previous works have taken advantage of these speech structures to improve system performance [61, 62, 63, 64, 39, 65]. For example, Zhao et al. [61] trained a hierarchical network for depression severity measurement, where frame-level and sentence-level representations were learned explicitly. However, human speech has more than these two levels. Other levels such as phone-level and word-level can better reflect the pronunciation. To address this problem, a vast majority of works are explored toward hierarchical multi-granularity learning. For example, [62] utilized a hierarchical attention structure with word-level alignment for emotion recognition. In [63], phone- and word-level representations were captured through a GRU network [66], using the ground-truth timestamps of every unit. [64] aggregated the acoustic embedding for each word based on its corresponding speech frames for information fusion. Although the above methods improve the recognition performance, the requirement of exact timestamps of phones or words in their systems makes them unsuitable for practical applications. Additionally, current studies have not sufficiently explored the hierarchy of speech signal, leaving the full potential of Transformer in the speech domain unexplored. Recently, researchers demonstrated that deep neural networks can integrate the individual basic units across multiple timescales via different integration windows, the sizes of which were yoked to the duration of the units [67]. This finding also indicates that the basic units in speech are instructing the model learning. However, existing works on Transformer have not comprehensively considered the audio properties, which is remedied in this paper.

III Methodology

The proposed framework, as shown in Fig 2, consists of four stages and three key modules. The unit encoder and word encoder are used for structure-based speech unit learning, and a merging block is employed for structure-based speech unit aggregation. We first clarify the guidelines for model design. Afterwards, we elaborate on the proposed SpeechFormer++.

III-A Guiding Principles of Model Design

The statistical duration of the speech unit is the basis for the design of our framework. Therefore, we first estimate the durations of phones and words on the corpora used in this paper by P2FA [68] toolkit. Since the distribution of the unit duration is similar for each corpus, we illustrate in Fig. 3 the statistical results obtained by combining all the audio files from four corpora. We note that more than 80% of phones vary from 50 to 200 ms, and we therefore approximate the shortest and longest durations of phones to be 50 and 200 ms, respectively. Similarly, almost 90% of words range between 250 to 1000 ms, which we regard as the shortest and longest durations of words. Additionally, we note that the duration of the frame is literally the frame length used when extracting acoustic features, which can be set manually. In addition, the hierarchical pattern in the speech signal aggregates the consecutive units progressively, which sheds new light on the design of the hierarchical framework.

III-B Structure-Based Speech Unit Learning

III-B1 Unit Encoder

Given a speech signal, we first extract its acoustic representations $x_{1}\in\mathbb{R}^{T_{1}\times d_{1}}$ , where $T_{1}$ is the number of frames and $d_{1}$ is the dimension of each frame embedding. To capture the information about consecutive frames in the frame stage, we employ a unit encoder with window $T_{w1}$ to learn the frame-grained features in $x_{1}$ . Specifically, the frame-grained input feature $x_{1}$ is split into $T_{1}$ overlapping segments:

[TABLE]

where subscript $i$ denotes the different stages in Fig 2 (e.g., $i$ = 1 for the frame stage, $i$ = 2 for the phone stage, $i$ = 3 for the word stage and $i$ = 4 for the utterance stage); $OverlapSeg(\cdot)$ represents the overlapping segmentation, and $j\in[1,T_{i}]$ ; $x_{i}[a:b]\in\mathbb{R}^{(b-a)\times d_{i}}$ consists of the $a$ -th to the $b$ -th tokens of $x_{i}$ . The subscript $i$ is equal to 1 because it is currently in the frame stage. Zero padding is employed when the segment is out of range (e.g., when $a<0$ or $b>T_{i}$ ). The value of $T_{w1}$ is set to the number of tokens that can be contained within 50 ms (the shortest duration of phones) of input $x_{1}$ . Thus, the interactions of nearby frames are learnt. Specifically, the attention in each segment can be written as:

[TABLE]

where $x_{i}^{(j)}\in\mathbb{R}^{1\times d_{i}}$ is the $j$ -th token in $x_{i}$ , $j\in[1,T_{i}]$ ; $\hat{x}_{i}^{(j)}$ and $\hat{x}_{i}$ are the updated values of $x_{i}^{(j)}$ and $x_{i}$ , respectively. $MSA(Q,K,V)$ represents the Multi-Head Self-Attention (MSA) mechanism with inputs query $Q$ , key $K$ and value $V$ . More details of MSA can be found in [26]. $Norm(\cdot)$ represents the layer normalization [69] throughout the paper. When performing attention, the query $x_{i}^{(j)}$ denotes the central feature token in the current overlapping segment.

In the phone stage, we assume the phone-grained input feature to be $x_{2}\in\mathbb{R}^{T_{2}\times d_{2}}$ , where $T_{2}$ denotes the number of phone tokens and $d_{2}$ denotes the dimensions of phone embeddings. Each token contained in $x_{2}$ is the representation of a phone or subphoneme, which is produced by the merging block using the output of the frame stage (described in III-C) and fed into the phone stage. To model a phone and the interactions with its neighbors, the value of window $T_{w2}$ is set to the number of tokens that can be contained within 400 ms (twice the longest duration of phones) of $x_{2}$ . Thus, each segment covers consecutive phones, and the central phone is unbroken. Finally, the attention calculation in the phone stage follows Eqs. 1-3 with $i=2$ .

Similarly, in the word stage, its word-grained input feature is $x_{3}\in\mathbb{R}^{T_{3}\times d_{3}}$ , where $T_{3}$ and $d_{3}$ represent the number of word tokens and the dimensions of word embeddings, respectively. It is produced by a merging block using the output of the phone stage (described in III-C). To capture the intra- and inter-word information, the window size $T_{w3}$ in the word stage is set to the number of tokens that can be contained within 2000 ms (twice the longest duration of words) of $x_{3}$ . The attention mechanism is then invoked in the overlapping segments, each of which contains a central word and its surrounding context. The computational process follows Eqs. 1-3 with $i=3$ .

III-B2 Word Encoder

The proposed unit encoder is able to model the fine-grained features efficiently. However, its receptive field is limited by the size of the attention window. To take the coarse-grained information into account, we propose a word encoder (Fig. 4(c)) to inject the coarse-gained information into each unit encoder. We first create several learnable word tokens $z_{1}\in\mathbb{R}^{T_{z}\times d_{1}}$ for the frame stage, where $T_{z}$ indicates the approximate number of words in the utterance. Concretely, the value of $T_{z}$ is equal to the total duration of the utterance divided by 1000 ms (the longest duration of words). $z_{2}\in\mathbb{R}^{T_{z}\times d_{2}}$ for the phone stage and $z_{3}\in\mathbb{R}^{T_{z}\times d_{3}}$ for the word stage are produced by the merging block (described in III-C). Then, the input $x_{i}$ is evenly grouped into $T_{z}$ non-overlapping segments. Each learnable word token is required to learn the coarse-grained features about the corresponding segment. The learning process is as follows:

[TABLE]

where $EvenSeg(\cdot)$ denotes the non-overlapping segmentation, $s_{ij}$ is the $j$ -th non-overlapping segment of $x_{i}$ and $j\in[1,T_{z}]$ , $z_{i}^{(j)}$ denotes the $j$ -th learnable word token in $z_{i}$ and $\bar{z}_{i}^{(j)}$ is the updated value of $z_{i}^{(j)}$ . Since the interactions between words are modeled by the word stage of SpeechFormer++, we perform non-overlapping segmentation in the word encoder. Note that the number of non-overlapping segments is always identical to that of the learnable word tokens and remains constant across different stages.

Then, we pass the learnt $\bar{z}_{i}$ to each unit encoder in the $i$ -th stage, allowing the unit encoders to take the coarse-grained information into consideration while modeling locally. As shown in Fig. 4(b), each acoustic segment is enhanced by its corresponding learnable word token, which is then fed into the MSA and FFN layers. The complete calculation flow in the unit encoder (Fig. 4(b)) is as follows:

[TABLE]

where $e_{i}^{jk}\in\mathbb{R}^{(1+T_{wi})\times d_{i}}$ is the enhanced segment, $Ceil[\cdot]$ rounds a number upward to its nearest integer and $j\in[1,T_{i}]$ ; $FFN(\cdot)$ denotes the feed-forward network and $\bar{x}_{i}$ denotes the final output of the unit encoder. The parameters in the MSA are shared between the unit encoder and the word encoder, keeping the size of the model unchanged. In addition, a unit encoder and a word encoder constitute a basic SpeechFormer++ block. Multiple SpeechFormer++ blocks are stacked to form a stage in our proposed framework.

III-C Structure-Based Speech Unit Aggregation

III-C1 Merging Block

Inspired by the hierarchical property of speech signals that can be progressively categorized into frames, phones and words, we propose a merging block to generate the relevant features under the instruction of the statistical durations of the speech units. As shown in Fig. 2, merging blocks are used between each stage. Initially, the acoustic input of the frame stage $x_{1}$ represents the features of each frame from the original speech signal. To provide the phone-grained input to the phone stage, we apply average pooling over the output of the frame stage $\bar{x}_{1}$ with a merging scale $M_{1}$ of 50 ms (the shortest duration of phones). Then, a linear projection and layer normalization are performed to create the phone-grained feature $x_{2}$ . The information contained every 50 ms is aggregated into a token in $x_{2}$ such that each token in $x_{2}$ represents the information of a subphoneme. Analogously, the merging scale $M_{2}$ is set to 250 ms (the shortest duration of words) when attempting to generate the word-grained input $x_{3}$ for the word stage, making each token in $x_{3}$ a representation of a subword. Finally, the last merging block is applied to the output of the word stage $\bar{x}_{3}$ while merging scale $M_{3}$ is set to 1000 ms (the longest duration of words) to roughly simulate the number of words in the utterance sample. The learnable word tokens $z_{i}$ represent the coarse-grained features in words, and thus, we do not have to aggregate them. Formally, the merging block is defined as:

[TABLE]

where $AvgPool(x,M)$ represents an average pooling layer performed on $x$ with window size and stride equal to $M$ ; $W_{i}\in\mathbb{R}^{d_{i}\times d_{i+1}}$ and $b\in\mathbb{R}^{d_{i+1}}$ are to be learned parameters; $\bar{x}_{i}$ and $\bar{z}_{i}$ denote the outputs of the $i$ -th stage and $x_{i+1}$ and $z_{i+1}$ denote the inputs of the next stage, $i\in\{1,2,3\}$ .

The outputs of the third merging block are concatenated together and fed into the utterance stage, which is a stack of standard Transformer encoders, to model the speech signal globally. The overview of the computational flow in our proposed method is illustrated in Fig. 5. The acoustic tokens are aggregated progressively to imitate the structural pattern in the speech signal, and the attention is guided by the characteristics of speech. The final output of the utterance stage is pooled in the temporal dimension and is passed to a classifier, which is composed of two linear projections with an activation function in between, to yield the final classification results.

III-D Loss Function

We choose the categorical cross-entropy loss (CCE) as the objective function in this paper. Suppose we have $S$ samples and $C$ possible categories. The CCE can be represented as:

[TABLE]

where $\hat{y}_{sc}\in\mathbb{R}^{1}$ denotes the predicted probability that the $s$ -th sample belongs to class $c$ and $y_{sc}\in\mathbb{R}^{1}$ is 1 when c is equal to the ground-truth label and 0 otherwise.

III-E Complexity Analysis

Supposing inputs $x\in\mathbb{R}^{T\times d}$ , $z\in\mathbb{R}^{T_{z}\times d}$ and window size is $T_{w}$ . The computational complexities of the MSA in Transformer and SpeechFormer++ (S-MSA) are:

[TABLE]

Note that we omit softmax computation in determining complexity. When $T_{w}$ and $T_{z}$ are fixed, $\Omega(S\mbox{-}MSA)$ scales linearly with the sequence length $T$ , while $\Omega(MSA)$ in the standard Transformer scales quadratically. Moreover, the values of $T_{w}$ and $T_{z}$ are much smaller than that of $T$ in practice. When features go through a merge block, the number of tokens is greatly reduced, enabling the computational cost of the later layers in SpeechFormer++ to become fairly low. The cost of the merging block is negligible compared to the total complexity and the model size.

IV Experimental Setup

IV-A Datasets and Evaluation Metric

IEMOCAP [70] is the most commonly used dataset in the speech emotion recognition field. It contains 12 hours of audio data and consists of five sessions, each of which has one male speaker and one female speaker. 5,531 utterances from four emotion categories: angry, neutral, happy111We merge the excited samples with the happy samples in IEMOCAP. and sad, are considered in this work. To train and test the model, we conduct experiments in the leave-one-session-out cross-validation strategy. Specifically, samples from 4 sessions are used for training, and the remaining session is regarded as the testing set, which is repeated 5 times until all different sessions are used for training and testing. We evaluate at each epoch the model on the testing set and the reported results are the average scores of the 5-fold experiments.

MELD [71] is the second dataset we used for emotion recognition. The dataset contains 13,708 utterances from the Friends TV series, divided into 7 emotion classes: anger, disgust, sadness, joy, neutral, surprise and fear. Since this dataset has been officially divided into training, validation and testing sets, we use the validation set for hyperparameter turning. The model with the best performance on the validation set across epochs is evaluated on the testing set. Finally, the results on the testing set are reported.

Pitt [72] is a classical dataset used in the AD detection field. To produce the narrative speech recordings, the AD patients and healthy controls are asked to take the “Cookie Theft” picture description task from the Boston Diagnostic Aphasia Examination [73]. To evaluate the model on Pitt dataset, the speaker-independent 10-fold cross-validation technique is implemented. Similar to IEMOCAP, we evaluate at each epoch the model on the testing set and the reported results are the average scores of the 10-fold experiments

DAIC-WOZ [74], used in AVEC 2017 [75], is a subset of the Distress Analysis Interview Corpus (DAIC) [74]. This dataset contains training, validation and testing sets originally, and a label depressed/not depressed is assigned to each clinical interview recording in the training and validation sets, but the labels of the test data are not provided. Therefore, we randomly select 20% of the training data for hyperparameter turning and checkpoint selection. Finally, the results on its original validation set are reported.

Following previous works[37, 33], we apply four evaluation metrics to evaluate the performance of different learning algorithms: weighted accuracy (WA), unweighted accuracy (UA), weighted average F1 (WF1) and macro average F1 (MF1). The above criteria can be formulated as:

[TABLE]

where $S_{c}$ denotes the number of samples of the $c$ -th category and $Acc(c)$ and $F1(c)$ are the classification accuracy and F1 score of the $c$ -th category, respectively.

For speech emotion recognition on IEMOCAP and MELD, we aim to predict the discrete emotion labels for each individual utterance. While conducting neurocognitive disorder analyses (i.e., namely, Alzheimer’s disease detection on Pitt and depression classification on DAIC-WOZ), we first receive a dialogue, and we then crop out the utterances of the participant based on the provided transcription timestamps. Subsequently, the utterances are processed and predicted, and a majority vote is applied to yield a subject-level prediction, which is used for final evaluation.

IV-B Implementation Details

IV-B1 Acoustic Features

Encouraged by the success of self-supervised learning models in various speech tasks, we utilize the pretrained HuBERT-large [53] model to extract the acoustic features. Specifically, the duration of each frame processed in HuBERT is 25 ms, and the hop length used when yielding the overlapping frames is 20 ms. The overlap between consecutive frames is 5 ms. In total, 1024-dimensional frame-grained features are extracted for each utterance sample. Recently, it has been reported that the output from the middle layer has the most pronunciation-related features [76]. Hence, we use the output from the 12-th layer of the 24-layer Transformer encoder in HuBERT. Unless otherwise stated, the pretrained self-supervised models are only used to extract the acoustic features and will not be involved in the training procedure. The max sequence lengths are set to 326, 224, 328 and 426 for IEMOCAP, MELD, Pitt and DAIC-WOZ, respectively, because 80% of samples in each dataset are shorter than the corresponding set sequence lengths. We also report the results of SpeechFormer++ with hand-crafted features, such as 80-dimensional log-mel filter bank coefficients (FBANK). Unless otherwise stated, Hubert features are used in SpeechFormer++.

IV-B2 Training Details

We train SpeechFormer++ in an end-to-end manner using a Nvidia GeForce RTX 2080 Ti GPU. The total number of training epochs are set to 120, 120, 80 and 60 for IEMOCAP, MELD, Pitt and DAIC-WOZ, respectively, and their initial learning rates are set to 0.0005, 0.0005, 0.001 and 0.0001, respectively. The learning rate gradually drops to 1% of the original by cosine annealing. The batch size is set to 32. The model is updated by SGD with momentum 0.9. The number of attention heads used in MAS is set to 8. For the sake of simplicity, the dimensions of $x_{i}$ and $z_{i}$ , $i\in\{1,2,3,4\}$ , are all set to 1024. Unless otherwise stated, the number of layers employed in the frame stage $N_{1}$ , phone stage $N_{2}$ and word stage $N_{3}$ are 2, 2 and 4, respectively, and the number of Transformer encoders used in the utterance stage $N_{4}$ is 4. As a result, the total number of layers of SpeechFormer++ is 12.

V Results and Discussion

In this section, we report the experiments conducted on four corpora, including three recognition tasks. First, we compare the proposed SpeechFormer++ with the standard Transformer architecture in terms of performance and computational efficiency. To be consistent with the settings of SpeechFormer++, a total of 12 Transformer encoders are used in the standard Transformer framework. Second, we present a comparison with previous works. Finally, we perform extensive ablation studies to better understand the effectiveness of each module.

V-A Speech Emotion Recognition on IEMOCAP

V-A1 Comparison to Transformer

Table I presents the results of Transformer and the proposed SpeechFormer++ on IEMOCAP. Our SpeechFormer++ has a slightly larger model size due to the addition of the merging blocks. However, the theoretical computational complexity (FLOPs) of SpeechFormer++ is greatly reduced (by 71.67%) compared to Transformer. Meanwhile, our model boosts performance consistently (0.705 vs. 0.685 in WA, 0.715 vs. 0.701 in UA and 0.707 vs. 0.692 in WF1), meaning that our model is efficient and effective.

V-A2 Comparison to Previous State-of-the-Art

Table II lists the results of SpeechFormer++ and existing works on IEMOCAP. Our model with HuBERT fetures achieves 0.705 WA, 0.715 UA and 0.707 WF1, surpassing the previous best results. Our SpeechFormer++ with hand-crafted features outperforms STC [14] (0.645 vs. 0.613 in WA, 0.658 vs. 0.604 in UA and 0.649 vs. 0.617 in WF1) and achieves comparable results to LSTM-GIN [17] (0.645 vs. 0.647 in WA and 0.658 vs. 0.655 in UA) under the same experimental setup. SpeechFormer++ obtains inferior results compared to ISNet [15]. We suspect this is because ISNet is equipped with a carefully designed individual benchmark to alleviate the problem of interindividual emotion confusion. In addition, speaker information is used in ISNet. Our SpeechFormer++ is a general backbone and can be employed in ISNet for further improvement.

V-B Speech Emotion Recognition on MELD

V-B1 Comparison to Transformer

The results on MELD are shown in Table III. Compared to the standard Transformer, SpeechFormer++ yields a relative improvement of 5.16% in WA, 6.23% in UA and 3.52% in WF1. Although the model size of SpeechFormer++ is slightly larger, the computational effort of SpeechFormer++ is reduced from 15.33G to 4.51G, which is a 70.58% relative reduction.

V-B2 Comparison to Previous State-of-the-Art

Table IV compares SpeechFormer++ with previous state-of-the-art models on MELD. It can be observed that SpeechFormer++ with HuBERT features noticeably outperforms the previous works by a large margin of +3.1% WF1 and +1.2% WA. When compared under the hand-crafted features, SpeechFormer++ outperforms ConGCN [78], MMFA-RNN [19], MM-DFN [18] and CTNet [37] in terms of WF1. Note that SpeechFormer++ is simply applied in MELD and does not utilize the context and speaker information. This demonstrates the potential of SpeechFormer++ and the possibility of further improvement.

V-C Alzheimer’s Disease Detection on Pitt

V-C1 Comparison to Transformer

As shown in Table V, our SpeechFormer++ once again beats the standard Transformer framework on Pitt in terms of WA and UA while having much lower FLOPs. In detail, the results of SpeechFormer++ are +2.4 WA, +2.6 UA and +2.8 WF1 superior to the Transformer with a comparable model size (66.79M vs. 63.64M) and a significantly lower computational burden (6.58G vs. 23.28G).

V-C2 Comparison to Previous State-of-the-Art

Table VI gives the comparison among SpeechFormer++ with existing works on Pitt. Our method using HuBERT features outperforms other comparisons with promising gains: +7.7% WA over [10], +8.2% WA over [11], +8.8% WF1 over [79], +7.4% (+17.5%) WA (UA) over [20] and +7.3% (+7.6%) WA (UA) over [33]. Also, our method using FBANK outperforms the competitors using hand-crafted features in terms of WA and UA.

V-D Depression Classification on DAIC-WOZ

V-D1 Comparison to Transformer

Results of the standard Transformer and SpeechFormer++ on DAIC-WOZ corpus are shown in Table VII. For Transformer, the FLOPs reach 31.26G since the durations of audio samples in DAIC-WOZ are overall longer than those of the other three corpora. The computation effort grows rapidly as the length of the input sequence increases. Not surprisingly, our SpeechFormer++ delivers superior performance (0.771 vs. 0.686 in WA, 0.726 vs. 0.661 in UA and 0.709 vs. 0.658 in MF1) while keeping the FLOPs at a relatively low level (8.53G)

V-D2 Comparison to Previous State-of-the-Art

The comparison results of the proposed SpeechFormer++ and the previous works on DAIC-WOZ are presented in Table VIII. Our method with HuBERT features outperforms currently advanced approaches by a considerable margin in all metrics. Additionally, SpeechFormer++ using FBANK features achieves state-of-the-art compared to other hand-crafted feature-based methods, drawing the improvement of 0.3% $\sim$ 15.7% on WA, 7.4% $\sim$ 16.8% on UA and 5.3% $\sim$ 18.1% on MF1.

V-E Comparison Under the HuBERT Features

To release the impact of input features, we compare SpeechFormer++ with other approaches using HuBERT features. Experimental results in Table IX demonstrate that SpeechFormer++ shows superior performance on four corpora. For IEMOCAP, our method shows an absolute improvement of 2.9% $\sim$ 4.8% on WA, 4.6% $\sim$ 5.6% on UA and 4.5% $\sim$ 6.4% on WF1 over other competitors. For MELD, our method shows an absolute improvement of 1.0% $\sim$ 3.3% on WA, 1.9% $\sim$ 3.0% on UA and 1.2% $\sim$ 2.4% on WF1 over other competitors. For Pitt, our method shows an absolute improvement of 1.9% $\sim$ 7.3% on WA, 2.5% $\sim$ 7.6% on UA and 3.4% $\sim$ 6.3% on WF1 over other competitors. For DAIC-WOZ, our method shows an absolute improvement of 5.7% $\sim$ 8.5% on WA, 2.3% $\sim$ 2.5% on UA and 1.5% $\sim$ 3.3% on MF1 over other competitors. The reason lies in that other methods ignore the structural features of the speech signal, which is remedied in SpeechFormer++. These results verify the effectiveness of the proposed method.

V-F Ablation Study

In this section, we conduct a comprehensive ablation study on the four corpora to determine the role of each module in our SpeechFormer++. All ablation studies are implemented in the same configuration, except for the module under investigation. In addition, we investigate the sensitivity of SpeechFormer++ to the statistical phone and word durations. Finally, we compare SpeechFormer++ with finetuning of HuBERT to verify the importance of the downstream model.

V-F1 Effectiveness of Unit Encoder

We first replace the unit encoder with the standard Transformer encoder, which means each layer in the modified model always applies full attention among all the acoustic tokens. The merging blocks are preserved in the modified model. Since the word encoder is used to enhance the unit encoder, we also remove the word encoder when the unit encoder is disabled. The results are reported in Table X. The computational complexity is also listed for comparison. Supported by the unit encoder, SpeechFormer++ performs better than its counterpart by +2.77% (+3.03%) WA (UA) on IEMOCAP, +2.84% WF1 on MELD, +2.14% (+1.62%) WA (UA) on pitt and +4.26% MF1 on DAIC-WOZ. Generally, it demonstrates that modeling structure-based acoustic information under the instruction of the characteristics of speech can distinctly improve the performance of recognition. In addition, the attention mechanism in the proposed unit encoder scales linearly with the sequence length, which allows our SpeechFormer++ to achieve better performance at lower computational cost.

V-F2 Effectiveness of Word Encoder

To verify the potency of the word encoder, we discard all learnable word tokens in SpeechFormer++ such that the unit encoder can only perceive the local information within the window. The results are summarized in Table XI. The performance of the counterpart is weaker than SpeechFormer++ on four corpora, especially for IEMOCAP (0.696 vs. 0.705 in WA), MELD (0.464 vs. 0.470 in WF1), and DAIC-WOZ (0.672 vs. 0.709 in MF1). This indicates that the coarse-gained information cannot be neglected even if the fine-gained feature is effective. The word encoder presents an efficient way to help each unit encoder consider the coarse-gained information when modeling local segments. Note that the parameters in the word encoder and unit encoder are shared, enabling the introduction of the word encoder to not increase the model size but only the computational cost. In particular, the relative increments at FLOPs are +7.73%, +7.89%, +7.52% and +7.57% on IEMOCAP, MELD, Pitt and DAIC-WOZ, respectively.

V-F3 Effectiveness of Merging Block

To analyze the indispensability of the merging block, we implement a modified model with solely the unit encoder and the word encoder, where the number of input tokens is the same for each layer. In SpeechFormer++, each time the acoustic feature goes through a merging block, the number of tokens is greatly reduced, which significantly reduces the computational burden for the following layers. More concretely, as shown in Table XII, the computational complexity of SpeechFormer++ is reduced by 70.46% on IEMOCAP, 70.03% on MELD, 70.51% on Pitt and 70.91% on DAIC-WOZ compared to the counterpart. The model size is increased by merely 4.95% over the Transformer. Furthermore, the presence of the merging block brings +0.37% $\sim$ +5.51% relative gains over four corpora, which further confirms the necessity of the merging block.

V-F4 Sensitivity to the Statistical Durations

SpeechFormer++ is performed under the instruction of the statistical durations of speech. To investigate the sensitivity of SpeechFormer++ to the statistical phone and word durations, we intentionally set the phone and word durations longer or shorter than the statistics described in Section III-A. The experimental results are shown in Fig. 6. When the value of the x-axis $mismatch$ in Fig. 6 is larger than 1, the durations of phone and word used in the system are $mismatch$ times longer than the respective statistics. On the contrary, if the $mismatch$ is less than 1, the durations used are shorter than the statistical durations. The durations are consistent with the statistics only if the $mismatch$ is equal to 1. The durations determine the window size in the encoder and the merging scale in the merging block, which further impact the performance and computational complexity. As shown in Fig. 6, the FLOPs gradually decreases when the durations used increase. The accuracy of SpeechFormer++ remains generally robust when the $mismatch$ lies between 0.9 and 1.1, suggesting that we can apply SpeechFormer++ directly to other English datasets with similar statistics. When the $mismatch$ is larger than 1.3 or less than 0.7, the performance starts to break, especially on DAIC-WOZ. These results suggest that we should recalculate the duration of each speech unit when processing different languages or language dialects.

V-F5 Comparison to Finetuning of Pretrained Model

When the input feature is obtained from an already pretrained model (HuBERT in this paper), the proposed SpeechFormer++ can be viewed as a downstream model for the downstream task. To investigate the importance of the downstream model, we conduct experiments to compare the performances of finetuning the pretrained model with a simple MLP (3 dense layers) and learning further deep representation with SpeechFormer++. In addition, we implement Transformer and SpeechFormer++ with only 4 layers to investigate the impact of model size on small datasets. The experimental results in Table XIII show that finetuning the pretrained model with the simple MLP obtains inferior results compared to the SpeechFormer++. The reason lies in that the pretrained model utilizes general self-supervised tasks, which do not consider the characteristics of speech. For results of SpeechFormer++ with 4 layers, we observe that it outperforms the 12 layers SpeechFormer++ on DAIC-WOZ. This is mainly because DAIC-WOZ is a small-scale dataset. Thus, only a small number of parameters are needed to fit the data. On the other three datasets, SpeechFormer++ with 12 layers delivers superior performance. Note that SpeechFormer++ also outperforms Transformer when both employ only 4 layers. Our work provides a new perspective on modeling speech signals. In this paper, we choose to use 12 layers of Transformer and SpeechFormer++ across four datasets merely for the sake of simple, straightforward and consistent comparisons. In real-world applications, the number of layers employed in each stage can be tuned for optimal performance according to the training dataset.

V-G Adopting Attention from Computer Vision

We have confirmed that the standard full attention mechanism from NLP is unsuitable for the PSP task. Furthermore, we are interested in the performance of adopting attention methods from computer vision, which are optimized according to the characteristics of the images. Typically, the shifted window-based algorithm Swin-T222Codes of Swin: https://github.com/microsoft/Swin-Transformer [28] and the cluster-based algorithm BOAT-T333Codes of BOAT: https://github.com/mahaoyuHKU/pytorch-boat [29] are considered in this paper. Note that we follow the same configurations as their original papers. The recognition results on four corpora are reported in Table XIV. Computational costs are also provided for the sake of fair comparison. Unsurprisingly, the use of vision algorithms causes a significant drop in performance in the PSP tasks, as well as an increased computational burden. The results indicate that we cannot simply adapt the attention methods from other domains to speech, but instead, we need to make our own improvements based on the characteristics of the speech signal. Thus the proposed SpeechFormer++ presents a solution to fill the gap in the literature.

V-H Visualization of Attention Weights

From the experimental results discussed above, we conclude that SpeechFormer++ provides better results than the standard Transformer with less computational cost. It also outperforms the previous state-of-the-art approaches by a large margin on four commonly used corpora. To further understand the model and determine the reasons behind the improvements, we consider two utterance samples in IEMOCAP and compare the attention weights in Transformer and SpeechFormer++ by visualization. Here, the attention weights indicate the importance of each token in the model, which are obtained by adding up all the weights of the same value vector in MSA. Note that the merging block in SpeechFormer++ aggregates the acoustic tokens, resulting in a different number of input tokens in different layers and difficulties in comparison. For that reason, we visualize the attention weights in the first layer of SpeechFormer++ and Transformer, where the number of input tokens is the same for both models. The visualization results of the two utterance samples are illustrated in Fig. 7. The attention weights are limited to the range (0, 1) by softmax function. In addition, we manually mark out the content of the samples for better comprehension and analysis. For the first sample, forced breathing appears in the segment of the left bounding box, which is conducive for recognition. However, the attention weights in the Transformer indicate that the Transformer is not interested in that area and assigns relatively low weights to the corresponding tokens. Similarly, the right bounding box of the first sample is the prolongation of a particular word, which is essential for recognition but is omitted by Transformer. Our SpeechFormer++ is able to alleviate the above issues by assigning reasonable attention weights to the tokens in sample 1, where no informative content is neglected. For the second sample in Fig. 7, the left bounding box outlines an imperceptible sigh that is completely ignored by the Transformer and accurately captured by our SpeechFormer++. With this sigh signal, the model can accomplish the recognition effectively. The right bounding box of sample 2 shows a continuous utterance in which multiple words are spoken in quick succession. The attention weights in Transformer are relatively stable, while those in SpeechFormer++ fluctuate. This is because our method is capable of capturing more detailed information. In other words, SpeechFormer++ applies rapidly changing attention weights to model the fine-grained features in a cost-effective manner.

VI Conclusion

In this paper, we exploit the potential of Transformer by considering the essence of the audio properties. We reveal the implicit relationships in speech and propose a structure-based framework, called SpeechFormer++, for paralinguistic speech processing. SpeechFormer++ takes the intra- and inter-unit features into account while preserving the coarse-grained information to further boost the performance. In addition, the merging blocks are applied to imitate the hierarchical structure in the speech signal. Experimental results on four speech-related corpora demonstrate that our method substantially surpasses the standard Transformer with respect to performance and efficiency. Additionally, the comparison to state-of-the-art also confirms the superiority of SpeechFormer++. In the future, we intend to make use of the lexical information and develop a unified textual-audio framework. In addition, we intend to consider the semantic information in SpeechFormer++ for solving speech recognition tasks.

Bibliography82

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Moore, L. Tyler, and W. Marslen-Wilson, “Introduction. The perception of speech: from sound to meaning,” Philosophical transactions of the Royal Society of London. Series B, Biological sciences , vol. 363, no. 1493, pp. 917–921, Mar. 2008.
2[2] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” in IEEE International Conference on Acoustics, Speech, and Signal Processing , vol. 1, 1995, pp. 660–663.
3[3] M. Crouse, R. Nowak, and R. Baraniuk, “Wavelet-based statistical signal processing using hidden Markov models,” IEEE Transactions on Signal Processing , vol. 46, no. 4, pp. 886–902, 1998.
4[4] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” in 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings , vol. 1, 2003, pp. I–401.
5[5] J. Cichosz and K. Slot, “Emotion recognition in speech signal using emotion-extracting binary decision trees,” Proceedings of affective computing and intelligent interaction , 2007.
6[6] L. Yang, D. Jiang, L. He, E. Pei, M. C. Oveneke, and H. Sahli, “Decision tree based depression classification from audio video and language information,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge , ser. AVEC ’16, 2016, pp. 89–96.
7[7] A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller, “Deep neural networks for acoustic emotion recognition: Raising the benchmarks,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2011, pp. 5688–5691.
8[8] Z. Wu, E. S. Chng, and H. Li, “Conditional restricted Boltzmann machine for voice conversion,” in 2013 IEEE China Summit and International Conference on Signal and Information Processing , 2013, pp. 104–108.