VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan; Sugandha Sharma; Ubaid Abbas; Puneeth N Ail

arXiv:2603.07708·cs.SD·March 10, 2026

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N Ail

PDF

Open Access

TL;DR

VoiceSHIELD-Small is a real-time, lightweight model built on Whisper-small that detects malicious speech and transcribes audio simultaneously, achieving high accuracy with minimal delay for voice AI security.

Contribution

It introduces a novel integrated model combining transcription and malicious speech detection in real time, optimized for low latency and high accuracy.

Findings

01

Achieved 99.16% accuracy and 0.9865 F1 score on test set.

02

Classifies audio in 90-120 ms on mid-tier GPUs.

03

Misses 2.33% of harmful inputs at default settings.

Abstract

Voice interfaces are quickly becoming a common way for people to interact with AI systems. This also brings new security risks, such as prompt injection, social engineering, and harmful voice commands. Traditional security methods rely on converting speech to text and then filtering that text, which introduces delays and can ignore important audio cues. This paper introduces VoiceSHIELD-Small, a lightweight model that works in real time. It can transcribe speech and detect whether it is safe or harmful, all in one step. Built on OpenAI's Whisper-small encoder, VoiceSHIELD adds a mean-pooling layer and a simple classification head. It takes just 90-120 milliseconds to classify audio on mid-tier GPUs, while transcription happens at the same time. Tested on a balanced set of 947 audio clips, the model achieved 99.16 percent accuracy and an F1 score of 0.9865. At the default setting, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Emotion and Mood Recognition