Evaluating Automatic Speech Recognition in an Incremental Setting

Ryan Whetten; Mir Tahsin Imtiaz; Casey Kennington

arXiv:2302.12049·cs.CL·February 24, 2023

Evaluating Automatic Speech Recognition in an Incremental Setting

Ryan Whetten, Mir Tahsin Imtiaz, Casey Kennington

PDF

Open Access

TL;DR

This paper systematically evaluates six speech recognizers for incremental recognition, comparing their accuracy, latency, and stability, and introduces new metrics to better understand their performance in real-time applications.

Contribution

It introduces Revokes per Second as a new metric and compares two methods for streaming audio, providing insights into the performance of different speech recognizers.

Findings

01

Local recognizers are faster and require fewer updates than cloud-based ones.

02

Meta's Wav2Vec is the fastest recognizer.

03

Mozilla's DeepSpeech is the most stable in predictions.

Abstract

The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta's Wav2Vec model to be the fastest, and…

Tables2

Table 1. Table 1 : Local asr engines along with their used models and training data if available.

Name (abbreviation)	Model	Training Data
Wav2Vec (W2V)	wav2vec2-base-960h	LibriSpeech
DeepSpeech (DS)	0.9.3	Fisher, LibriSpeech, Switchboard, Common Voice English
PocketSphinx (PS)	N/A	1600 utterances from the RM-1
Vosk	en-us-0.22	N/A

Table 2. Table 2 : Summary of results. The bold indicates the best performance and the italicized indicates the lowest performance for the given metric in the far left column. Local asr s had lower latency than cloud-based asr s. The Concatenation method, shown in the columns that contain a (Con.) , had higher latency and resulted in a higher EO and RPS, but not as many revokes as the online asr s. inf means zero revokes per second.

Incremental asr Results on LibriSpeech
	Google	Azure	W2V	W2V (Con.)	DS	DS (Con.)	PS	PS (Con.)	Vosk	Vosk (Con.)
WER	13.2	9.1	10.6	4.0	18.3	8.4	40.4	31.8	33.4	6.4
LAT	0.197	0.539	0.099	0.127	0.181	1.443	0.105	0.220	0.104	0.167
EO	0.279	0.065	0.011	0.093	0.001	0.013	0.014	0.147	0.072	0.019
R/Sec	4.564	0.679	0.141	1.919	0.008	0.012	0.178	1.688	0.910	0.143
Sec/R	0.219	1.473	7.083	0.521	123.135	80.489	5.613	0.593	1.099	7.004

Equations4

R P S = \frac{R}{N} \frac{N}{T im e ( s )} = \frac{R}{T im e ( s )}

R P S = \frac{R}{N} \frac{N}{T im e ( s )} = \frac{R}{T im e ( s )}

S P R = \frac{T im e ( s )}{R} = \frac{1}{R P S}

S P R = \frac{T im e ( s )}{R} = \frac{1}{R P S}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Evaluating Automatic Speech Recognition in an Incremental Setting

Abstract

The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word-error-rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta’s Wav2Vec model to be the fastest, and find Mozilla’s DeepSpeech model to be the most stable in its predictions.

**Index Terms— ** Automatic Speech Recognition, Incremental, Spoken Dialogue Systems

1 Introduction

Performance in automatic speech recognition (asr) has improved dramatically in the last decade. Many asr models process incrementally in that they produce word or sub-word output as the recognition unfolds, which is an important requirement for spoken dialogue systems (sds) that are multimodal or part of a robot platform because there is a high expectation of timely interaction from human dialogue partners [1]. Good asr is critical in sds applications because errors and delays produced by the asr propagate to the downstream modules and overall system function. Most asr models use the word-error-rate (wer) metric to evaluate the asr, even in conversational settings—they do not usually consider incremental metrics [2]. [3, 4] propose metrics for evaluation of incremental performance such as Edit Overhead, Word First Correct Response, Disfluency Gain, and Word Survival Rate. All of the metrics can be classified into one of the following three general areas of interest: overall accuracy, speed, and stability, but these metrics focus on discrete word-level output.

In this paper, we make three contributions: (1) we evaluate six recent incremental asr models on English data, and we also (2) propose a continuous metric that computes how much the model changes its output over time, and (3) a comparison of two methods for combining sub-word output incrementally. Following prior work [5, 6, 7], the evaluations provide for a useful guide in deciding which asr model one should use. All of the models are implemented as modules in the ReTiCo framework [8] for ease of use in incremental settings.

2 Models & Metrics

Following the evaluation strategy in [4], we adopt the Incremental Unit (iu) framework from [9]. The iu framework is practical because it is well designed and has multiple implementations from which we can build our incremental asr evaluation. The framework is built around incremental units, a discrete piece of information that is produced by a specific module. In our case, we focus on the asr model as a module, and output is discretized into words (i.e., strings). The iu framework has provisions for handling cases where the asr output was found to be in error, given new information. The iu framework proposes three operations for ius: add, revoke, and commit. A perfect asr would only add new words to the growing recognition prefix. But as most asrs have errors—particularly when they work incrementally—the revoke operation allows the asr module to remove an erroneous iu and replace it (i.e., through another add operation) in the recognized output. An example is shown in Figure 1.

ReTiCo is a Python implementation of the iu framework [8]. We use a ReTiCo implementation for each of the asr models evaluated. We use six different, readily available asr models; 2 cloud-based and 4 local (i.e., on a local GPU), chosen due to their respective results and accessibility. The cloud-based models are Google Cloud’s Speech-to-Text API and Microsoft Azure’s Speech SDK. Due to the limited amount of information given about the online asr models, we can not go into depth about the architecture and training behind these models, but we explain the 4 local asr models below. The local models are summarized in Table 1.

Wav2Vec (W2V): We use Meta’s Wav2Vec model [10] from a checkpoint provided by HuggingFace where the model has been pre-trained and fine-turned on 960 hours of Librispeech [11]. This architecture is unique in that it is pre-trained on hours of unlabeled raw audio data. While other models first convert the audio into a spectrogram, Wav2Vec operates directly on audio data.111https://huggingface.co/facebook/wav2vec2-base-960h

DeepSpeech (DS): Mozilla’s DeepSpeech engine, is based on work done by [12]. This architecture uses Recurrent Neural Networks that operate on spectrograms of the audio to make predictions. We use the 0.9.3 model and scorer for predictions. This model was trained using a wider variety of data from Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1,700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to them to be used as training corpora.222https://deepspeech.readthedocs.io/en/r0.9/

PocketSphinx (PS): One of the lighter asrs we tested is CMU’s PocketSphinx [13]. PS is a light-weight asr that is a part of the open source speech recognition tool kit called the CMUSphinx Project. This model was trained on 1,600 utterances from the RM-1 speaker-independent training corpus. Unlike the previously mentioned models, PS does not use neural networks and is instead based on traditional methods of speech recognition by using hidden Markov models, language models, and phonetic dictionaries.333https://github.com/cmusphinx/pocketsphinx-python

Vosk: Alpha Cephei’s Vosk (with the vosk-model-en-us-0.22 model) is built on top of Kaldi [14], and like PocketSphinx, uses an acoustic model, language model, and phonetic dictionary. Vosk uses a neural network for the acoustic model part of the engine.444https://alphacephei.com/vosk/

2.1 Metrics

As mentioned, all previously proposed metrics for evaluating incremental asr can be divided into three broad categories: overall accuracy (using wer), speed, and stability. We review the specific metrics used for the latter two and introduce our new metric which combines these last two categories of speed and stability into a single metric.

2.1.1 Predictive Speed: Latency

In order to measure the general predictive speed of an asr model, we measure the time it takes from the time the asr engine gets the audio until the prediction is made. We then take this time and divide by the number of words in that particular prediction. With this, we define latency as the average amount of time per word it takes an asr engine to make a prediction: $LAT=\frac{Time}{N}$ , where time is measured in seconds and N is the total number words in a given prediction.

2.1.2 Stability: Edit Overhead

For measuring stability, we measure the edit overhead (EO). EO is the total number of revokes divided by the total number of edits (additions and revokes) that the asr engine makes. In an incremental sds setting, this could be thought of as the fraction of text incremental units that are revokes: $EO=\frac{R}{\#\ of\ Edits}$ .

2.1.3 Revokes per Second

Our proposed and final metric is the number of Revokes per Second (RPS). We propose this metric as way to capture the relationship between both speed and stability in an interpretable fashion. In an incremental sds setting, this is the average number of asr word output ius per second that are labeled as type revoke.

We first calculate the average number of revokes per word, then divide the average number of revokes per word by our metric for latency to get the average number of Revokes per Second. We also look at the inverse Seconds per Revoke (SPR) as a simple adjustment to this metric to see how many seconds will pass by before one can expect to see a revoke. This SPR value is useful in interpretations when the RPS is low. Taken together, the formulas for these metric are as follows:

[TABLE]

2.2 Combining Sub-word Output

Both Google and Azure offer incremental asr results. For these two asrs, the audio files are sent to the cloud services in chunks, and the service returns a prediction with other meta-information. The local asr engines work at word and sub-word levels, necessitating a method of combining the sub-word output into words.555We used the same PC with a GTX1080TI GPU for the local models. We apply and compare two methods in this evaluation: Sliding Window and Concatenation.

For Sliding Window, we pass the audio from the file in chunks that are a bit longer than one second. These are then concatenated together as an audio buffer and given to an asr model until it produces a prediction of at least 5 words or the audio buffer contains about 30 seconds of audio. At this point, we remove the first 35% and repeat. This results in a series of predictions on segments of audio containing around 2 to 5 words. When a prediction is received, it is joined together with previous predictions. Due to overlap in incoming predictions, the way that the predictions are joined together is non-trival. The lookup method joins predictions using dictionaries from WordNet and NLTK [15, 16].

For the Concatenation method, we present the audio in chunks into an audio buffer in the same manner as the Sliding Window method, except the buffer is a concatenation of all the audio (i.e., no audio ever gets removed from the buffer). Essentially, with this method, the asr model makes a prediction from the very beginning of the file to the most recent audio given to the buffer. This is computationally more expensive and takes more memory because the asr model has to make predictions on longer pieces of audio as time goes on, but this method eliminates the need for joining. Diagrams showing these two methods can be seen in Figures 2 and 3.

3 Experiment

In this section, we explain our experiment including the evaluation data we used, and how we systematically produced and evaluated the ius from our asr modules.

3.1 Data & Procedure

For evaluation, we use 2 datasets, LibriSpeech and a recently assembled dialogue dataset of simulated medical conversations [17].666We were unable to obtain the Switchboard corpus due to prohibitive costs. The LibriSpeech test-clean dataset contains 5.4 hours of speech from 40 different speakers, 20 male and 20 female. This audio is divided into over 2,600 files with an average of about 20 words per file containing a vocabulary of over 8,100 words. To ensure the audio would work on all of our models, we converted the audio files to WAV files.

The medical conversation dataset contains 272 audio files with corresponding transcripts. The audio files range from around 7 to 20 minutes in length or about 800 to 2,200 words. Due to the size of these audio files, we split up the files into utterances based on silence and then randomly sample a set of 40 utterances, 17 of which were able to be processed by all 6 asr engines (max 40 seconds, min 0.8 seconds, 6.1 seconds in length on average). This happened due to the length of some of the files and the constraints that each model can handle. The purpose of using this dialogue data is to 1) test each model on domain data that presumably none of them have been trained on (since this dataset was just made public in 2022), and 2) test how each engine performs on a dialogue dataset that contains disfluencies such as fillers, corrections, and restarts.

3.2 Results

The results can be seen in Table 2. When using the Sliding Window method, local models had lower latency than both the cloud models. Some of the local asr models using the Concatenation method were also faster than both of the cloud ones, but generally the concatenation tests were slower and had a higher EO than the Sliding Window method. Despite this, the Concatenated versions performed better than their corresponding Sliding Window version in terms of wer. For the cloud models, Google is less accurate and more revoke dependant than Azure. However, Google is considerably quicker which could be crucial in an interactive dialogue setting. The cloud models had surprisingly low latency (though the latency is dependent on the Internet speed), but the local asrs generally had the lowest latency.

The local asr engine which performed the best overall in terms of wer was the W2V model using the Concatenation method on the LibriSpeech data and Vosk on the Medical Dialogue data, while the model with the lowest Edit Overhead was the DS model using the Sliding Window method. Though a low wer is generally better, the number of revokes has implications for downstream modules in an sds; keeping the EO low and Revokes per Second low with a low wer means the model was correct early, which is ideal.

Our results are consistent with previous evaluations on Incremental asr [4] that show that Google’s asr predictions, although fairly accurate overall, are not as stable as the others, with the highest Edit Overhead of 0.279/0.228 and an average of about 4.5/5.1 Revokes per Second on the LibriSpeech dataset and Medical Dialogue dataset respectively.

The DS model’s wer was higher than other models, but the low EO and infrequent number of revokes make it a potentially good candidate for an sds that requires high accuracy as well as low latency and EO, for example in a robotic platform. We suggest Concatenation for live microphones because it is more accurate and does not require a dictionary.

4 Conclusion

In this work, we tested six different asr models in an incremental sds setting and evaluated using final wer, latency, and Edit Overhead. We also proposed a new metric, Revokes per Second. We showed that, generally, online asr (Google Cloud and Azure cloud services) is not as fast as most local asr engines tested, and while these are some of the most accurate asrs we tested, they both have a relatively high number of Revokes per Second which, in combination with the latency, could potentially lead to more issues in an incremental setting.

One of the challenges of the evaluation of asr models is that, as described, the cloud asrs do not publicly describe the precise architecture and training data used, and each of the local asrs differs greatly in architecture and in the training data used. With this, there are too many variables and unknowns to attribute good wer in a given model to its architecture, or due to training data. That being said, we do believe that in terms of testing the out of box performance, our results are conclusive that online asr tend to have higher latency and Edit Overhead. Furthermore, we also believe that our proposed metric, Revokes per Second, is an interpretable useful metric that should be used as asr becomes more prevalent in live settings such as in Spoken Dialogue Systems on a robots or in live captioning in online meetings.

In future work, we plan on evaluating using different datasets and in different languages.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Casey Kennington, Daniele Moro, Lucas Marchand, Jake Carns, and David Mc Neill, “rr SDS: Towards a robot-ready spoken dialogue system,” in Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue , 1st virtual meeting, July 2020, pp. 132–135, Association for Computational Linguistics.
2[2] Andrew Cameron Morris, Viktoria Maier, and Phil Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition,” in Eighth International Conference on Spoken Language Processing , 2004.
3[3] Timo Baumann, Michaela Atterer, and David Schlangen, “Assessing and improving the performance of speech recognition for incremental systems,” in Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics , 2009, pp. 380–388.
4[4] Timo Baumann, Casey Redd Kennington, J. Hough, and David Schlangen, “Recognising conversational speech: What an incremental asr should do for a dialogue system and how to get there,” in IWSDS , 2016.
5[5] Fabrizio Morbini, Kartik Audhkhasi, Kenji Sagae, Ron Artstein, Doğan Can, Panayiotis Georgiou, Shri Narayanan, Anton Leuski, and David Traum, “Which ASR should I choose for my dialogue system?,” in Proceedings of the SIGDIAL 2013 Conference , Metz, France, Aug. 2013, pp. 394–403, Association for Computational Linguistics.
6[6] Seyed Hossein Alavi, Anton Leuski, and David Traum, “Which model should we use for a real-world conversational dialogue system? a cross-language relevance model or a deep neural net?,” in Proceedings of the 12th Language Resources and Evaluation Conference , Marseille, France, May 2020, pp. 735–742, European Language Resources Association.
7[7] Kallirroi Georgila, Anton Leuski, Volodymyr Yanov, and David Traum, “Evaluation of off-the-shelf speech recognizers across diverse dialogue domains,” in Proceedings of the 12th Language Resources and Evaluation Conference , Marseille, France, May 2020, pp. 6469–6476, European Language Resources Association.
8[8] Thilo Michael, “Retico: An incremental framework for spoken dialogue systems,” in Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue , 2020, pp. 49–52.