VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Jiliang Hu; Zuchao Li; Ping Wang; Haojun Ai; Lefei Zhang; Hai Zhao

arXiv:2410.00822·cs.SD·October 8, 2024

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, Hai Zhao

PDF

Open Access 2 Repos 2 Models 1 Video

TL;DR

VHASR introduces a multimodal speech recognition system that leverages vision hotwords through a dual-stream architecture, significantly improving recognition accuracy over unimodal models and achieving state-of-the-art results in image-based ASR.

Contribution

The paper presents a novel multimodal ASR system utilizing vision hotwords with a dual-stream architecture, enhancing speech recognition performance.

Findings

01

VHASR outperforms unimodal ASR models.

02

Achieves state-of-the-art results on multiple datasets.

03

Effectively utilizes image information to improve recognition.

Abstract

The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

VHASR: A Multimodal Speech Recognition System With Vision Hotwords· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training