Speech recognition assisted by large language models to command software orally -- Application to an augmented and virtual reality web app for immersive molecular graphics
Fabio Cortes Rodriguez, Luciano Abriata

TL;DR
This paper presents a speech recognition and large language model-based voice interface for controlling an AR/VR molecular graphics web app, enabling hands-free operation through natural language commands.
Contribution
It develops and evaluates a novel VUI system integrating Chrome's ASR with LLM-driven function calling for immersive molecular visualization.
Findings
Chrome's ASR was more reliable than Whisper for scientific jargon.
Function calling with GPT-4o-mini proved safer and more efficient than code generation.
The system enables natural language control of AR/VR molecular graphics.
Abstract
This project successfully developed, evaluated and integrated a Voice User Interface (VUI) into a web application that we are developing for immersive molecular graphics. Said app provides augmented and virtual reality (AR and VR) environments where users manipulate molecules with their hands, but this means the hands can't be used to control the app through a regular mouse- and keyboard-based GUI. The speech-based VUI system developed here alleviates this problem, making it easy to control the app via natural spoken (or typed) commands. To achieve this VUI we evaluated two distinct Automated Speech Recognition (ASR) systems: Chrome's native Speech API and OpenAI's Whisper v3. While Whisper offered broader browser compatibility, its tendency to "hallucinate" with specialized scientific jargon proved very problematic. Consequently, we selected Chrome's ASR for its stability, speed, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
