A low latency attention module for streaming self-supervised speech   representation learning

Jianbo Ma; Siqi Pan; Deepak Chandran; Andrea Fanelli; Richard; Cartwright

arXiv:2302.13451·cs.SD·March 19, 2024

A low latency attention module for streaming self-supervised speech representation learning

Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard, Cartwright

PDF

Open Access 1 Repo

TL;DR

This paper introduces a low-latency attention module for streaming self-supervised speech learning, enabling real-time inference with reduced latency and memory, while maintaining high accuracy in speech recognition tasks.

Contribution

The paper proposes a novel low-latency streaming attention module (LLSA) that guarantees fixed latency even with multiple layers, improving real-time speech processing capabilities.

Findings

01

Achieved 5.84% WER on Librispeech test set.

02

Reduced inference latency from 1.92 to 0.16 seconds.

03

Significantly outperformed masked acausal attention (WER 13.82%).

Abstract

The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL…

Tables3

Table 1. Table 1 : ASR performance comparison, measured as character error rate (CER) %. AA, SA and LLSA denote models trained with Acausal Attention, Streaming Attention and Low-Latency Streaming Attention, respectively. ‡latency is the duration of whole file. †the model is fine-tuned over additional 10 epochs using the SA model at epoch 290 as initialization.

Attention	Latency	dev-clean	test-clean
AA	ALL‡	5.78	5.69
SA	1.8 s	5.73	5.67
SA+LLSA†	300 ms	6.73†	6.84†

Table 2. Table 2 : ASR performance comparison between wav2vec2 models with acausal attention (AA-wav2vec2), Streaming Attention (SA-wav2vec2) and Low Latency Streaming Attention (LLSA-wav2vec2). #params denotes number of parameters of downstream task only.

Model	#params	Latency	WER (%)
AA-wav2vec2 + BLSTM	$42.8$ M	ALL‡	$6.42$
SA-wav2vec2 + LSTM	$34.4$ M	$3.6$ s	$7.55$
LLSA-wav2vec2 + LSTM	$34.4$ M	$300$ ms	$8.06$

Table 3. Table 3 : Speech Emotion Recognition performance comparison.

Upstream model	Latency	ACC (%)
SUPERB-AA-HuBERT	ALL‡	64.95
AA-HuBERT	ALL‡	63.22
SA-HuBERT	3.6 s	64.97
LLSA-HuBERT	300 ms	65.04

Equations28

z_{t} = [k_{0}, k_{1}, ...., k_{N_{T} - 1}]^{T} \frac{q _{t}}{d _{k}} \vspace - 0.15 c m

z_{t} = [k_{0}, k_{1}, ...., k_{N_{T} - 1}]^{T} \frac{q _{t}}{d _{k}} \vspace - 0.15 c m

y_{t} = [v_{0}, v_{1}, ..., v_{N_{T} - 1}] a_{t} . \vspace - 0.45 c m

y_{t} = [v_{0}, v_{1}, ..., v_{N_{T} - 1}] a_{t} . \vspace - 0.45 c m

m_{t} = [0, 0, \dots, 1, 1, \dots, 1, \dots, 0], \vspace - 0.15 c m

m_{t} = [0, 0, \dots, 1, 1, \dots, 1, \dots, 0], \vspace - 0.15 c m

z_{t} = [k_{t - B}, k_{t - B + 1}, ..., k_{t}, k_{t + 1}, ..., k_{t + A}]^{T} \frac{q _{t}}{d _{k}} .

z_{t} = [k_{t - B}, k_{t - B + 1}, ..., k_{t}, k_{t + 1}, ..., k_{t + A}]^{T} \frac{q _{t}}{d _{k}} .

y_{t} = [v_{t - B}, v_{t - B + 1}, ..., v_{t}, v_{t + 1}, ..., v_{t + A}] a_{t} . \vspace - 0.25 c m

y_{t} = [v_{t - B}, v_{t - B + 1}, ..., v_{t}, v_{t + 1}, ..., v_{t + A}] a_{t} . \vspace - 0.25 c m

\displaystyle\frac{\partial\mathbf{y_{n}}}{\partial\mathbf{v_{t}}}=\left\{\begin{array}[]{rcl}a_{n,t-n}&\mbox{if }\quad t-A\leq n\leq t+B\\ 0&\mbox{otherwise}\end{array}\right.

\displaystyle\frac{\partial\mathbf{y_{n}}}{\partial\mathbf{v_{t}}}=\left\{\begin{array}[]{rcl}a_{n,t-n}&\mbox{if }\quad t-A\leq n\leq t+B\\ 0&\mbox{otherwise}\end{array}\right.

\frac{\partial ℓ}{\partial v _{t}} = n = t - B \sum t + A a_{n, t - n} \frac{\partial ℓ}{\partial y _{n}} . \vspace - 0.4 c m

\frac{\partial ℓ}{\partial v _{t}} = n = t - B \sum t + A a_{n, t - n} \frac{\partial ℓ}{\partial y _{n}} . \vspace - 0.4 c m

\frac{\partial y _{t}}{\partial z _{t}} = [v_{t - B}, ..., v_{t}, v_{t + 1}, ..., v_{t + A}] J .

\frac{\partial y _{t}}{\partial z _{t}} = [v_{t - B}, ..., v_{t}, v_{t + 1}, ..., v_{t + A}] J .

\frac{\partial ℓ}{\partial q _{t}} = \frac{1}{d _{k}} \frac{\partial ℓ}{\partial y _{t}} [v_{t - B}, ..., v_{t}, ..., v_{t + A}] J [k_{t - B}, ..., k_{t}, ..., k_{t + A}]^{T} . \vspace - 0.35 c m

\frac{\partial ℓ}{\partial q _{t}} = \frac{1}{d _{k}} \frac{\partial ℓ}{\partial y _{t}} [v_{t - B}, ..., v_{t}, ..., v_{t + A}] J [k_{t - B}, ..., k_{t}, ..., k_{t + A}]^{T} . \vspace - 0.35 c m

\displaystyle\frac{\partial\mathbf{z}_{n}}{\partial\mathbf{k}_{t}}=\left\{\begin{array}[]{rcl}\mathrm{M}_{n}&\mbox{for}\quad t-A\leq n\leq t+B\\ 0&others\end{array},\right.

\displaystyle\frac{\partial\mathbf{z}_{n}}{\partial\mathbf{k}_{t}}=\left\{\begin{array}[]{rcl}\mathrm{M}_{n}&\mbox{for}\quad t-A\leq n\leq t+B\\ 0&others\end{array},\right.

\frac{\partial ℓ}{\partial k _{t}} = n = t - A \sum t + B \frac{\partial ℓ}{\partial y _{n}} \frac{\partial y _{n}}{\partial z _{n}} M_{n} .

\frac{\partial ℓ}{\partial k _{t}} = n = t - A \sum t + B \frac{\partial ℓ}{\partial y _{n}} \frac{\partial y _{n}}{\partial z _{n}} M_{n} .

z_{t, c} = [k_{t - c - B, A}, ..., k_{t - c, A}, k_{t - c + 1, A - 1}, \dots, k_{t + A - c, 0}]^{T} \frac{q _{t, c}}{d _{k}}

z_{t, c} = [k_{t - c - B, A}, ..., k_{t - c, A}, k_{t - c + 1, A - 1}, \dots, k_{t + A - c, 0}]^{T} \frac{q _{t, c}}{d _{k}}

y_{t, c} = [v_{t - c - B, A}, ..., v_{t - c, A}, v_{t - c + 1, A - 1}, \dots, v_{t + A - c, 0}] a_{t, c}

y_{t, c} = [v_{t - c - B, A}, ..., v_{t - c, A}, v_{t - c + 1, A - 1}, \dots, v_{t + A - c, 0}] a_{t, c}

\frac{\partial ℓ}{\partial v _{t, c_{2}}} = c_{1} \sum n = t - A + c_{2} \sum t + B + c_{2} a_{t - n, n, c_{1}} \frac{\partial ℓ}{\partial y _{n, c 1}} . \vspace - 0.1 c m

\frac{\partial ℓ}{\partial v _{t, c_{2}}} = c_{1} \sum n = t - A + c_{2} \sum t + B + c_{2} a_{t - n, n, c_{1}} \frac{\partial ℓ}{\partial y _{n, c 1}} . \vspace - 0.1 c m

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jianboma/low_latency_attention_module
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Full text

Low latency transformers for speech processing

Abstract

The transformer is a widely-used building block in modern neural networks. However, when applied to audio data, the transformer’s acausal behaviour, which we term Acausal Attention (AA), has generally limited its application to offline tasks. In this paper we introduce Streaming Attention (SA), which operates causally with fixed latency, and requires lower compute and memory resources than AA to train. Next, we introduce Low Latency Streaming Attention (LLSA), a method which combines multiple SA layers without latency build-up proportional to the layer count. Comparative analysis between AA, SA and LLSA on Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) tasks are presented. The results show that causal SA-based networks with fixed latencies of a few seconds (e.g. 1.8 seconds) and LLSA networks with latencies as short as 300 ms can perform comparably with acausal (AA) networks. We conclude that SA and LLSA methods retain many of the benefits of conventional acausal transformers, but with latency characteristics that make them practical to run in real-time streaming applications.

Index Terms— Transformer, speech processing, self-attention, causal attention, low latency

1 Introduction

The transformer, introduced in [1], is one of the most popular building blocks in modern neural network architectures due to its outstanding modelling capacity. Transformers have been applied in many fields, such as Natural Language Processing (NLP) [1][2], Natural Language Understanding (NLU) [3], Computer Vision (CV) [4], and Speech and Audio Processing [5][6]. A transformer makes use of one or more attention units, which use all of the information available in a sequence of data to produce each output. This is a massive advantage when translating text from English to French, where the word order is very different, for example, but when operating on audio data, transformer architectures ingest input information that is arbitrarily distant in the past or in the future to produce a single output for a frame or sample. In other words, transformers are fundamentally acausal for audio processing and we say they use an Acausal Attention (AA) mechanism. While this is not a problem when training and testing on short audio sequences (for example, recordings of single speech utterances), it limits the transformer’s usage in streaming applications where causality (i.e. a fixed latency) is required, such as telecommunication and broadcast. Thus, fixed-latency transformers are of interest in many fields.

Several authors have proposed methods for creating causal transformers. The chunk-wise transformer is proposed to segment the input vectors into sequential chunks and apply the attention mechanism within each chunk [7][8][9]. This process treats each chunk independently, and the segmentation causes sample discontinuity at the edges of each chunk. The memory-based method introduced by Wu et al. [10][11] uses an additional memory bank to enable longer contextual dependency, but careful segmentation is required. Another group of methods involve masking or time-restricting the attention score to limit the receptive field [1][12][13][6][14]. However, in section 2.2 we find that, while this group of methods has the expected effect, masking is a computationally inefficient technique for implementing this idea, particularly on audio data. As discussed in section 4, these methods also suffer from latency build-up as multiple restricted attention layers with look-ahead are concatenated.

In this paper, we propose two new methods to overcome the limitations described above. The first method, which we call Streaming Attention (SA), restricts the receptive field of an attention unit in a manner that has higher computational and memory efficiency than previous methods. The second method, which we call Low Latency Streaming Attention (LLSA), builds on SA and solves the latency build-up problem as layers are concatenated, facilitating low latency use in real-time systems. We derive forward- and back-propagation equations for both methods and report on experiments in which we created dedicated GPU implementations of SA and LLSA. We compare the resulting AA, SA and LLSA transformers for Automatic Speech Recognition (ASR) task and Speech Emotion Recognition (SER) task. Our results show comparable performances for both ASR, and SER task as the system is first made causal (SA with latency of a few seconds), then low-latency (LLSA with latency 300 ms).

2 Acausal Attention

2.1 Acausal Attention

The structure of a transformer is introduced in [1], where the concept of Multi-Head Attention (MHA) is used. We here describe the core building block of MHA - the Scaled Dot-Product Attention (SDPA) unit - using vector quantities instead of the matrix quantities in [1] in order to emphasize the temporal relationships.

For a transformer implementing self-attention, each input vector $\mathbf{x}_{t}$ is projected into three quantities known as the query ( $\mathbf{q}_{t}$ , of length ${d}_{k}$ ), the key ( $\mathbf{k}_{t}$ , also of length ${d}_{k}$ ) and the value ( $\mathbf{v}_{t}$ , of length ${d}_{v}$ ) by projecting $\mathrm{W}_{q}$ , $\mathrm{W}_{k}$ , $\mathrm{W}_{v}$ respectively. These projections can be implemented causally since they rely only on input from time $t$ . Together, $\mathbf{q}_{t}$ , $\mathbf{k}_{t}$ and $\mathbf{v}_{t}$ form the input to the SDPA unit.

In the SDPA unit, $\mathbf{z}_{t}$ , of length ${N}_{T}$ , is obtained as the scaled dot-product between the query $\mathbf{q}$ at time index $t$ and each of the keys $\mathbf{k}$ ,

[TABLE]

where $t$ is the time index, T is transpose operator and $N_{T}$ is the number of frames. The attention score vector $\mathbf{a}_{t}$ is then computed as $\mathbf{a}_{t}=softmax(\mathbf{z}_{t})$ , where $softmax(\cdot)$ is the softmax operator. Finally, $\mathbf{y_{t}}$ is computed using the attention score vector $\mathbf{a_{t}}$ to form a linear combination of the values $\mathbf{v}$ :

[TABLE]

2.2 Masked Acausal Attention

From eq. 1 and eq. 2, it can be seen that all the key and value vectors from time [math] to $N_{T}-1$ are involved when calculating each output $\mathbf{y}_{t}$ , and, therefore, the SDPA unit is acausal. In [1] and [12], the authors go on to describe masking of the attention score $\mathbf{a_{t}}$ so that it does not depend on the entire future input. The masking vector is

[TABLE]

where $1$ means the corresponding time index is used as normal when computing $\mathbf{a}_{t}$ , and [math] indicates that the corresponding position in $\mathbf{z}_{t}$ is replaced with a large negative value prior to computing $\mathbf{a}_{t}$ so that $\mathbf{y}_{t}$ does not depend on it. It can be seen that, in addition to avoiding the use of future data, this technique can also be used to limit the amount of history data that is considered.

While the output $\mathbf{y}_{t}$ does not depend mathematically on unwanted future or past input, all the $\mathbf{z}_{t}$ relating to unwanted input positions are still computed and then subsequently replaced, meaning the system remains acausal. Furthermore, such a method is computationally inefficient when operating on audio data with a reasonable receptive field since all of the $\mathbf{z}_{t}$ values are computed in $O({d}_{k}{N}_{T}^{2})$ time, then extra work is done to replace most of them with a dummy value. Moreover, in a practical computer system, large amounts of memory (often scarce GPU memory) must be dedicated to storing all of the $\mathbf{z}_{t}$ values, many of which will be unused. The practical effect of this is to limit the effective batch sizes that can be used during training as well as increasing the computation time per batch.

Consider the example of a 60 second speech utterance with feature extraction running on a 10 millisecond time step. This vector would have ${N_{T}}=6000$ . If we were to restrict the receptive field of the network to 1.2 seconds, each masking vector $\mathbf{m}_{t}$ would consist of 120 ones and 5880 zeros. This results in the use of only 720,000 out of the 36 million attention values that are computed and stored in memory, as well as 35,280,000 replacements by dummy values.

3 Streaming Attention

We now introduce Streaming Attention (SA), a method for computing only those elements of $\mathbf{z}_{t}$ that are actually required in order to achieve a desired receptive field, resulting in substantially higher computational efficiency and much lower memory use for a given batch size, compared to the Masked Acausal Attention method described above.

As the back-propagation algorithm [15] is commonly used in the training of neural networks, errors generated by the objective function need to flow back for use in updating parameters during training [16]. The chain rule of calculus [16] is used to combine derivatives of different operators that is calculated in a specific order, which makes the process more efficient.

3.1 Forward Propagation

The core idea of SA is to limit the receptive field to $A$ frames of ”future” or look-ahead, and $B$ frames of history or look-back data relative to the input data at time $t$ . Then, by adding appropriate delays of $A$ frames, a causal SPDA operator with fixed receptive field and latency $A$ frames is obtained. We introduce $A$ and $B$ into eq. 1 to obtain

[TABLE]

Where $\mathbf{z}_{t}$ is now only $A+1+B$ frames in length instead of ${N_{T}}$ . $\mathbf{a}_{t}$ is calculated as $softmax(\mathbf{z}_{t})$ as before. For calculating $\mathbf{y}_{t}$ , we then use

[TABLE]

3.2 Back-propagation

As mentioned above, the derivatives of the loss with respect to each of the SDPA inputs need to be developed for updating parameters of the model. This includes $\frac{\partial\ell}{\partial\mathbf{v}_{t}}$ , $\frac{\partial\ell}{\partial\mathbf{k}_{t}}$ and $\frac{\partial\ell}{\partial\mathbf{q}_{t}}$ , where $\ell$ is the overall loss for one step that is usually a mini-batch.

3.2.1 Derivative with respect to values

By looking at eq. 5, we see,

[TABLE]

where $\mathbf{a}_{n,t-n}$ denotes the $(t-n)^{th}$ element of vector $\mathbf{a}_{n}$ . $\frac{\partial\ell}{\partial\mathbf{y}_{n}}$ is the input to back-propgation and applying the chain rule with eq. 6, the full derivative with respect to $\mathbf{v_{t}}$ is,

[TABLE]

3.2.2 Derivative with respect to queries

The Jacobian matrix of the $softmax(\cdot)$ operator can be found in section 5.3.4 of [17] and denoted as $\mathrm{J}$ . $\frac{\partial\mathbf{y}_{t}}{\partial\mathbf{a}_{t}}$ can be determined by inspection of eq. 4 and combined with $\mathrm{J}$ using the chain rule to obtain,

[TABLE]

$\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{q}_{t}}$ can be determined by inspection of eq. 5, then applying the chain rule with eq. 8 and input to back-propagation $\frac{\partial\ell}{\partial\mathbf{y}_{n}}$ , we have

[TABLE]

3.2.3 Derivative with respect to keys

From eq.4, we obtain,

[TABLE]

and $\mathrm{M}_{n}$ is a $((A+B+1)\times dk)$ matrix, where the $(B+t-n)^{th}$ row is specified as $\frac{1}{\sqrt{d_{k}}}\mathbf{q}_{n}^{T}$ and others are zeros. Since there are time mixing between $\mathbf{y}$ and $\mathbf{k}$ , we can finally obtain full derivative by chain rule as,

[TABLE]

where $\frac{\partial\mathbf{y}_{n}}{\partial\mathbf{z}_{n}}$ is defined in eq.8.

4 Low Latency Streaming Attention

We now introduce Low Latency Streaming Attention (LLSA), a method for preventing latency buildup when multiple layers of SA are concatenated at the expense of higher computational complexity.

As mentioned in section 2, the entire input file is used to obtain keys and values for each query when operating using AA (fig.2a). In contrast, fig.2b shows two layers of transformer using SA. The red box shows the query currently being processed. The green boxes indicate which input frames are used as keys and values when processing that query. In this example the look-ahead ( $A$ ) and the look-back ( $B$ ) are both set to two frames for each of the two layers. In order to compute the output $\mathbf{y}_{t}$ of the first layer, input from time $t+2$ is required, so the first layer will impose a latency of two frames. For the same reason, the second layer will impose an additional latency of two frames, giving a total latency of four frames.

Fig.2c gives an example when processing the red frame in LLSA. It prevents latency accumulation by computing multiple output channels at each time step using different amounts of look-ahead (except at the beginning of the input where they are simply duplicated). In this example the $c=0$ output is computed considering zero look-ahead frames, the $c=1$ output considers one look-ahead frame and the $c=2$ output considers two look-ahead frames. When processing the highlighted red query, all look-back keys and values are drawn from the $c=2$ input, while the look-ahead keys and values are picked from $c=1$ for $t=1$ time index and from $c=0$ for time index $t=2$ . When the vector at $c=1,t=1$ serves as query, the same keys and values as the red vector are used. This is also true for the vector at $c=0,t=2$ .

The reason for the operation in fig.2c is more apparent when processing the next layer. When processing the red vector after layer 1, the $c=1,t=1$ and $c=0,t=2$ are already available since they depend on no additional future vectors other than these five vectors to compute attention. As a result, the latency does not build up as the number of layers increases.

LLSA-based SDPA units have multiple input channels $\mathbf{q}_{t,c}$ , $\mathbf{k}_{t,c}$ and $\mathbf{v}_{t,c}$ as well as multiple output channels $\mathbf{y}_{t,c}$ , where each channel of each signal considers a unique amount of look-ahead. To compute the output for each channel, we start with a unique $\mathbf{z}_{t,c}$ , the unnormalised attention score for output channel $c$ at time $t$ , as

[TABLE]

$\mathbf{y}_{t,c}$ is then computed as

[TABLE]

where $\mathbf{a}_{t,c}=softmax(\mathbf{z}_{t,c})$ .

The development steps for the derivative calculations are omitted here for brevity, but similar procedures as section 3.2 are followed to obtain them. For example,

[TABLE]

where $0\leq{c1,c2}\leq A$ and $a_{t-n,n,c_{1}}$ denotes the $(t-n)^{th}$ element of $\mathbf{a}_{n,c_{1}}$ .

5 Experiments

In order to test the effectiveness of the proposed methods, experiments on ASR and SER speech processing tasks are conducted. Since automatic differentiation is inefficient to process backward propagation for SA and LLSA architectures, dedicated GPU kernels are implemented in CUDA and can be called by PyTorch [18] during off-line model training. We validated the computational efficiency of the proposed SA method in section 5.1.

For the ASR task, two different types of investigations are reported. The first compares the ASR performance of acausal attention (AA) against our implementation of streaming attention (SA) and low latency streaming attention (LLSA), using Connectionist Temporal Classification (CTC) loss [19]. The second one uses a self-supervised pre-trained model, wav2vec 2.0 [20], to compare the AA, LA, and LLSA attention layers, and ASR serves as downstream task.

For the SER task, we compare AA, SA, and LLSA, using HuBERT [21] as upstream model and SER as downstream task. The SUPERB benchmark tool [22] is used to evaluate the performance.

5.1 Computational Efficiency

5.1.1 Lower Compute Bound

In section 2.2, we calculated the theoretical complexity of Masked Acausal Attention (MAA) as $O(d_{k}N_{T}^{2})$ , using redundant dummy value replacements. The computational complexity of our approach (SA) is also mathematically derived as $O(d_{k}N_{T}*(A+B+1))$ , where A refers to the number of look ahead vectors and B is the number of look back vectors.

Even though we have not optimized our code and our implementation does not reach the theoretical computational gain, we still observe anecdotally that the compute trend is linearly proportional to the receptive field. We did not show experimental results on execution time because we have not yet optimized our implementation, and we believe this would be an unfair comparison against the highly optimized CUDA operators in PyTorch. For example, the matrix multiply in MAA comes from highly optimised library, has been fully optimised by domain experts.

5.1.2 Lower Memory Usage

In MAA, many matrices including $\mathbf{z_{t}}$ , $\mathbf{a_{t}}$ , mask matrices etc are held for forward and backward propagation. They are all $N_{T}\times N_{T}$ size, so the lifetime of this temporary variable is long and the impact of its size on total memory requirement will be high. By contrast, in SA the attention scores $\mathbf{z_{t}}$ , $\mathbf{a_{t}}$ etc will be of $N_{T}\times(A+B+1)$ size. In most cases we care about, $N_{T}$ is on the order of e.g. 3000 (for 30s with 10ms hop) whereas the SA receptive field $A+B+1$ is on the order of 300. Thus, for useful cases in which the length of the vector is much longer than the receptive field of the SA transformer, a substantial theoretical memory saving is expected.

In addition to this theoretical computation, fig.3 provide a memory profiling case of the proposed SA that shows this advantage.

In this test experiment, we used a very typical configuration of the attention in many applications. The specifications are as follows: dimension of each attention is $d_{k}=64$ , number of frames for the input is $N_{T}=1000$ , number of heads in attention mechanism varies from $(8,16)$ , and number of receptive field $(A+B+1)$ varies from 10 to 490 with a step of 10. The experiments were repeated 5 times and the mean values are reported.

In fig.3, the red line denotes the memory usage for SA and the dashed blue line is the memory usage for MAA (which is constant since it does not have the concept of a receptive field). The results show that our method requires less memory per training vector than the MAA approach in many useful configurations, particularly when the number of heads is 8 or 16. That advantage is more pronounced when the receptive field is shorter, as expected. These results show that SA requires less memory per training vector in practice as well as in theory, allowing a larger batch size for a given GPU memory size. As additional anecdotal validation of this experimental result, we were able to train our model using a single GPU, while 2 GPUs are required to train MAA with an equivalent batch size

5.2 CTC-ASR task

In this experiment, we compare performance of a CTC-trained ASR system using AA, SA and LLSA. Input data is featurized by our own feature extraction module, computing the log-energy in 42 log-spaced frequency bands between 117 Hz and 7617 Hz on a 10 millisecond time step. Two consecutive convolutional layers with strides 3 and 2 respectively are used to downsample the processed feature representations. Six transformer layers are used in a similar structure as that in [1], including absolute positional encoding; multi-head attention, layer norm and feedforward layer are used in the transformer. When training the SA and LLSA variants, 1.2 seconds look-back and 0.3 seconds look-ahead are used. A hidden size of 512 is used as the input to each transformer layer, and feedforward layers have intermediate hidden size 2048. A simple greedy decoder [19] is used to compute the character error rate. Since this experiment focuses on latency, we report the results on dev-clean and test-clean of Librispeech [23] only.

Table 1 summarizes our findings. All of the models were trained with 300 epochs. The results show that the SA model trained over 300 epochs performs comparably as the one trained with AA. When we switch from SA training to LLSA training, the latency greatly improves, reducing from 1.8 seconds to 300 milliseconds and, despite training for only 10 additional epochs, there is no dramatic performance drop in ASR.

5.3 Low latency wav2vec2-ASR

In this section, we compare the ASR performance of the wav2vec 2.0 base model against its versions modified with our implementation of streaming attention. The ASR task in the SUPERB benchmark is used for evaluation.

The upstream models are implemented in fairseq [24]. We use the same model configuration as in [20] for the wav2vec 2.0 base model trained on LibriSpeech 960hr. We refer to the base model as the Acausal Attention wav2vec2 (AA-wav2vec2). The effective batch size is approximately 1.6 hours of audio samples and the model is trained for $400\,000$ iterations. To test the effectiveness of the proposed methods, the multi-head attention blocks in AA-wav2vec2 are replaced by the proposed streaming attention. We refer to the upstream models as the Streaming Attention wav2vec2 (SA-wav2vec2) and Low Latency Streaming Attention wav2vec2 (LLSA-wav2vec2), respectively. The model parameters of SA-wav2vec2 are initialized using the best performing AA-wav2vec2 model on LibriSpeech dev-clean and trained with $11\,000$ iterations. LLSA-wav2vec2 is then initialized using the parameters of SA-wav2vec2 and trained for additional $3\,000$ iterations.

We evaluate the upstream models using the ASR task in the SUPERB benchmark [22]. A 2-layer bi-directional LSTM (BLSTM) is used as the downstream model for AA-wav2vec2. A 4-layer uni-directional LSTM is used for SA-wav2vec2 and LLSA-wav2vec2, since these upstream models are strictly causal with fixed latencies. The upstream model parameters are frozen during ASR training. The downstream models for AA-wav2vec2 and SA-wav2vec2 are trained from randomly intialized parameters for $200\,000$ iterations, whereas that for LLSA-wav2vec2 is initialised from the downstream model of SA-wav2vec2 and trained for further $48\,000$ iterations. The results are reported in Table 2. Results show that the latency of running wav2vec2 models at inference has been significantly reduced without sacrificing much ASR performance.

5.4 Low latency HuBERT-SER

We further analyze the performance of our proposed streaming attention kernel by appropriately modifying a HuBERT [21] model and training it on the task of emotion recognition. We compare its performance against the original HuBERT model with conventional acausal attention (AA-HuBERT). Following a similar training procedure as the ASR experiments, the multi-head attention blocks in HuBERT are replaced by SA and LLSA and trianed with extra $76\,000$ steps from pre-trained HuBERT model (referred as SA-HuBERT) and $9\,000$ steps from SA-HuBERT (referred as LLSA-HuBERT ).

Similar as in the original SUPERB paper [22], the models are evaluated on the IEMOCAP dataset [25], which contains 12 hours of improvised and scripted recordings. The results are summarized in the form of classification accuracy in Table 3 and are compared with the results from the SUPERB benchmark (SUPERB-AA-HuBERT). While our results indicate a difference in performance for the unmodified HuBERT model when compared with the SUPERB baseline (possibly due to differences in training hyperparameters and random initializations), the HuBERT model modified with streaming attention does have a better test set accuracy than the unmodified model, in addition to the improvements in latency. Similar to the wav2vec2 ASR task, LLSA-HuBERT reduces the latency to 300 milliseconds with comparable results.

6 Conclusion

We introduce a new class of streaming transformers, that overcomes the main limitation of the traditional transformer: the acausality. Our solution builds on past work on causal self-attention masking, improving upon computational complexity, memory usage and latency. To achieve this, we propose Streaming Attention (SA), a method which increases efficiency and reduces computation redundancy of causal self-attention masks, and Low-Latency Streaming Attention (LLSA), which prevents latency accumulation across transformer layers.

We first tested our SA and LLSA blocks in a CTC-ASR task. Results show reduced CER and latency for SA and LLSA, with the enablement of causal inference in the ASR task. We then implemented our solution in a pre-trained semi-supervised architecture, with ASR as downstream task. Even in this experiment, we observe reduced latency and comparable WER. To further validate the general applicability of our solution, we applied the SA and LLSA blocks to HuBERT, with SER as downstream task. Our streaming transformer solution outperforms HuBERT with acausal transformer in emotion classification accuracy, while reducing latency and enabling transformer-based streaming emotion classification.

While we have shown applicability of our technology to ASR and SER tasks, we believe its applicability can be extended to support additional downstream tasks, including real-time noise suppression and talker identification, and is not limited to the model architectures covered in this paper, but can be extended to most transformer-based models.

In conclusion, our Streaming Attention (SA) and Low Latency Streaming Attention (LLSA) techniques provide efficient ways to realise causality in transformer architectures, which is important for processing streaming audio with fixed latency. Our solution enables the efficient use of transformer-based architectures in new scenarios such as telecommunication, broadcasting, and other real-time applications.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017.
2[2] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , 2020, pp. 38–45.
3[3] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu, “Tinybert: Distilling bert for natural language understanding,” ar Xiv preprint ar Xiv:1909.10351 , 2019.
4[4] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR) , vol. 54, no. 10s, pp. 1–41, 2022.
5[5] Yongqiang Wang, Abdelrahman Mohamed, Due Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, et al., “Transformer-based acoustic modeling for hybrid speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 6874–6878.
6[6] Niko Moritz, Takaaki Hori, and Jonathan Le, “Streaming automatic speech recognition with the transformer model,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 6074–6078.
7[7] Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, and Lei Xie, “Streaming chunk-aware multihead attention for online end-to-end speech recognition,” ar Xiv preprint ar Xiv:2006.01712 , 2020.
8[8] Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” ar Xiv preprint ar Xiv:1712.05382 , 2017.