Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah; Jason Gusdorf; Anil Palepu; Chunjong Park; Jack W. O'Sullivan; Vishnu Ravi; Tim Strother; Pavel Dubov; Aliya Rysbek; Toshiyuki Fukuzawa; Yana Lunts; Jan Freyberg; Michael B. Chang; Aniruddh Raghu; David Stutz; Devora Berlowitz; Eliseo Papa; Taylan Cemgil; JD Velasquez; Jack Chen; Arthur Chen; Doug Fritz; Charlie Taylor; Katya Tregubova; Jing Rong Lim; Richard Green; Sara Mahdavi; Mahvish Nagda; Jihyeon Lee; Craig Schiff; Liviu Panait; Sukhdeep Singh; Valentin Li\'evin; David G.T. Barrett; Hannah Gladman; Anna Cupani; Francesca Pietra; Uchechi Okereke; Katherine Tong; Clemens Meyer; Erwan Rolland; Mili Sanwalka; Michael D. Howell; Shixiang Shane Gu; Bibo Xu; Euan A. Ashley; S. M. Ali Eslami; Gregory Wayne; Pushmeet Kohli; Vivek Natarajan; Adam Rodman; Alan Karthikesalingam; Ryutaro Tanno

arXiv:2605.09272·cs.AI·May 12, 2026

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O'Sullivan, Vishnu Ravi, Tim Strother, Pavel Dubov, Aliya Rysbek, Toshiyuki Fukuzawa, Yana Lunts, Jan Freyberg, Michael B. Chang, Aniruddh Raghu, David Stutz, Devora Berlowitz, Eliseo Papa, Taylan Cemgil, JD Velasquez

PDF

TL;DR

This paper introduces AI co-clinician, a real-time conversational medical AI system using audio-visual data to assist clinical decisions, demonstrating promising results in simulated telemedicine scenarios.

Contribution

The work presents a novel AI system leveraging continuous audio-visual streams for real-time clinical reasoning, advancing beyond text-only approaches in medical AI.

Findings

01

AI co-clinician approaches primary care physicians in key diagnostic dimensions

02

It significantly outperforms GPT-Realtime in general criteria

03

It matches physicians in case-specific triage but lags in overall performance

Abstract

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.