ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement
Jinwei Lin

TL;DR
This paper introduces ecVoice, a method that enhances audio text extraction from videos by using idiom similarity analysis, significantly improving recognition accuracy while maintaining low computational resource requirements.
Contribution
The paper presents a novel approach combining idiom similarity analysis with Whisper to improve audio text extraction quality in resource-constrained environments.
Findings
Idiom grammar correction rate improved to 90% on average.
Method is simple, fast, and requires less computing resources.
Significantly enhances Whisper's recognition accuracy with low memory usage.
Abstract
The Text Extraction of the Audio from the Video plays an important role in multimedia editing and processing. As a popular open source toolkit, Whisper performs fast in human voice recognition. However, the recognition performance is dependent on the computing resource, which makes the low computing memory running Whisper become difficult. Our paper presents an available solution to extract the human voice from the video and gain the high quality text generation from the voice. The generated voice can be used in video language translation and translated voice simulation. To improve the extraction and transform quality of human voice, we present ecVoice, a method using the idioms similarity computation and analysis to improve the quality of audio text extraction. Relative experiments are held to verify that the ecVoice can improve the idiom grammar correction rate to 90\% on average. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
