# High Precision Speech Keyword Spotting Based on Binary Deep Neural Network in FPGA

**Authors:** Ang Zhang, Jialiang Shi, Hui Qian, Junjie Wang

PMC · DOI: 10.3390/e27111143 · 2025-11-07

## TL;DR

This paper introduces a new binary neural network model for speech keyword spotting that improves accuracy while using fewer resources on IoT devices.

## Contribution

A novel Probability Smoothing Enhanced Binarized Neural Network (PSE-BNN) is proposed to balance accuracy and computational efficiency for FPGA deployment.

## Key findings

- PSE-BNN achieves 97.29% accuracy on the Google Speech Commands Dataset.
- The model uses 65% fewer hardware resources compared to state-of-the-art BNN-KWS designs.
- The smoothing filter reduces noise-induced entropy and improves signal-to-noise ratio.

## Abstract

Deep Neural Networks (DNNs) are the primary approach for enhancing the real-time performance and accuracy of Keyword Spotting (KWS) systems in speech processing. However, the exceptional performance of DNN-KWS faces significant challenges related to computational intensity and storage requirements, severely limiting its deployment on resource-constrained Internet of Things (IoT) edge devices. Researchers have sought to mitigate these demands by employing Binary Neural Networks (BNNs) through single-bit quantization, albeit at the cost of reduced recognition accuracy. From an information-theoretic perspective, binarization, as a form of lossy compression, increases the uncertainty (Shannon entropy) in the model’s output, contributing to the accuracy degradation. Unfortunately, even a slight accuracy degradation can trigger frequent false wake-ups in the KWS module, leading to substantial energy consumption in IoT devices. To address this issue, this paper proposes a novel Probability Smoothing Enhanced Binarized Neural Network (PSE-BNN) model that achieves a balance between computational complexity and accuracy, enabling efficient deployment on an FPGA platform. The PSE-BNN comprises two components: a preliminary recognition extraction module for extracting initial KWS features, and a result recognition module that leverages temporal correlation to denoise and enhance the quantized model’s features, thereby improving overall recognition accuracy by reducing the conditional entropy of the output distribution. Experimental results demonstrate that the PSE-BNN achieves a recognition accuracy of 97.29% on the Google Speech Commands Dataset (GSCD). Furthermore, deployed on the Xilinx VC707 hardware platform, the PSE-BNN utilizes only 1939 Look-Up Tables (LUTs), 832 Flip-Flops (FFs), and 234 Kb of storage. Compared to state-of-the-art BNN-KWS designs, the proposed method improves accuracy by 1.93% while reducing hardware resource usage by nearly 65%. The smoothing filter effectively suppresses noise-induced entropy, enhancing the signal-to-noise ratio (SNR) in the information transmission path. This demonstrates the significant potential of the PSE-BNN-FPGA design for resource-constrained edge IoT devices.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), KWS (MESH:D008796)
- **Chemicals:** BNN (-), RAM (MESH:C071315)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12650900/full.md

---
Source: https://tomesphere.com/paper/PMC12650900