Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features
Kumar Saurav

TL;DR
This paper introduces a fast, lightweight system that uses temporal speech activity features from a neural VAD to accurately detect voicemails in telephony calls in real time, suitable for large-scale deployment.
Contribution
The authors propose a novel, efficient approach leveraging temporal speech activity features and a shallow classifier, achieving high accuracy without complex processing or transcription.
Findings
Achieved 96.1% overall accuracy in voicemail detection across diverse datasets.
Maintained low false positive (0.3%) and false negative (1.3%) rates in production.
End-to-end inference runs in 46 ms on a dual-core CPU, supporting 380+ concurrent calls.
Abstract
Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
