ConvFill: Model Collaboration for Responsive Conversational Voice Agents
Vidya Srinivas, Zachary Englhardt, Maximus Powers, Shwetak Patel, Vikram Iyer

TL;DR
ConvFill introduces a hybrid approach combining on-device and cloud models to create responsive, knowledgeable conversational voice agents with low latency and high accuracy.
Contribution
The paper proposes conversational infill, enabling on-device models to generate contextually appropriate responses while integrating streaming knowledge from backend models, improving responsiveness and knowledge access.
Findings
ConvFill achieves 36-42% accuracy improvements over standalone small models.
Maintains sub-200ms response latency in evaluations.
Effective learning of conversational infill demonstrated across multiple backend models.
Abstract
Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Topic Modeling · Speech Recognition and Synthesis
