# A CTC-Based Speech Recognition Network Fusing Local Convolution and Global Attention

**Authors:** Huijuan Hu, Chenyang Tang, Ping Tan, He Xu

PMC · DOI: 10.3390/s26061865 · Sensors (Basel, Switzerland) · 2026-03-16

## TL;DR

This paper introduces a new speech recognition model that combines local and global processing to improve accuracy, especially for fast speech.

## Contribution

A dual-branch architecture with a task-aware gating mechanism is proposed to resolve the conflict between global and local modeling in CTC-based ASR.

## Key findings

- The method achieves 6.4% and 7.4% relative CER reductions on AISHELL-1 and ST-CMDS datasets.
- It shows a 15.3% relative performance gain in fast-speech scenarios.
- Structural adaptation at the decoding interface improves robustness to temporal variations.

## Abstract

What are the main findings?
A dual-branch architecture (DBA) is proposed to decouple temporal modeling into parallel local convolutional and global attention streams.A task-aware gating mechanism is designed to adaptively fuse heterogeneous features based on acoustic confidence.

A dual-branch architecture (DBA) is proposed to decouple temporal modeling into parallel local convolutional and global attention streams.

A task-aware gating mechanism is designed to adaptively fuse heterogeneous features based on acoustic confidence.

What are the implications of the main findings?
The method resolves the conflict between the global smoothing of wav2vec 2.0 and the local discriminative needs of Connectionist Temporal Classification (CTC) alignment.The approach significantly improves robustness in fast-speech scenarios, achieving a 15.3% relative performance gain.

The method resolves the conflict between the global smoothing of wav2vec 2.0 and the local discriminative needs of Connectionist Temporal Classification (CTC) alignment.

The approach significantly improves robustness in fast-speech scenarios, achieving a 15.3% relative performance gain.

Integrating wav2vec 2.0 with Connectionist Temporal Classification (CTC) for automatic speech recognition (ASR) often involves a trade-off between capturing global semantic consistency and maintaining local feature discriminability. This study proposes DBA-wav2vec 2.0, an architecture designed to manage these modeling requirements by decoupling temporal modeling into parallel local and global streams at the encoder–decoder interface. Depthwise separable convolutions are utilized to capture local acoustic structures, while a self-attention path is retained for long-range dependencies. A task-aware gating mechanism is introduced to integrate these heterogeneous features. By adjusting fusion weights based on acoustic input characteristics, the gate facilitates the refinement of posterior probability distributions, leading to more distinct alignment points. Experimental results on AISHELL-1 and ST-CMDS datasets show relative Character Error Rate (CER) reductions of 6.4% and 7.4%, respectively, compared to a baseline wav2vec 2.0 model. Further evaluations under varying speaking rates demonstrate a 15.3% relative improvement in fast-speech scenarios, suggesting that structural adaptation at the decoding interface can enhance the robustness of CTC-based systems against temporal variations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13030727/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13030727/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC13030727/full.md

---
Source: https://tomesphere.com/paper/PMC13030727