A multitask framework for automated interpretation of multi-frame right upper quadrant ultrasound in clinical decision support

Haiman Guo; Cheng-Yi Li; Yuli Wang; Robin Wang; Yuwei Dai; Qinghai Peng; Danming Cao; Zhusi Zhong; Thao Vu; Linmei Zhao; Chengzhang Zhu; Christopher Tan; Jacob Schick; Stephen Kwak; Farzad Sedaghat; Javad Azadi; James Facciola; Jonathan Feng; Dilek Oncel; Ulrike Hamper; Alex Zhu; Tej Mehta; Melissa Leimkuehler; Cheng Ting Lin; Zhicheng Jiao; Ihab Kamel; Jing Wu; Li Yang; Harrison Bai

arXiv:2601.12174·eess.IV·January 21, 2026

A multitask framework for automated interpretation of multi-frame right upper quadrant ultrasound in clinical decision support

Haiman Guo, Cheng-Yi Li, Yuli Wang, Robin Wang, Yuwei Dai, Qinghai Peng, Danming Cao, Zhusi Zhong, Thao Vu, Linmei Zhao, Chengzhang Zhu, Christopher Tan, Jacob Schick, Stephen Kwak, Farzad Sedaghat, Javad Azadi, James Facciola, Jonathan Feng, Dilek Oncel, Ulrike Hamper, Alex Zhu

PDF

Open Access

TL;DR

This paper introduces a multitask vision-language model that enhances the interpretation of right upper quadrant ultrasound by providing accurate diagnoses, coherent reports, and surgical decision support, thereby improving clinical workflow and decision-making.

Contribution

The study presents a novel multitask vision-language framework trained on large multi-center datasets, capable of comprehensive RUQ ultrasound interpretation including classification, report generation, and surgical decision support.

Findings

01

Achieved high diagnostic accuracy across tasks

02

Generated expert-level diagnostic reports

03

Effectively identified patients needing cholecystectomy

Abstract

Ultrasound is a cornerstone of emergency and hepatobiliary imaging, yet its interpretation remains highly operator-dependent and time-sensitive. Here, we present a multitask vision-language agent (VLM) developed to assist with comprehensive right upper quadrant (RUQ) ultrasound interpretation across the full diagnostic workflow. The system was trained on a large, multi-center dataset comprising a primary cohort from Johns Hopkins Medical Institutions (9,189 cases, 594,099 images) and externally validated on cohorts from Stanford University (108 cases, 3,240 images) and a major Chinese medical center (257 cases, 3,178 images). Built on the Qwen2.5-VL-7B architecture, the agent integrates frame-level visual understanding with report-grounded language reasoning to perform three tasks: (i) classification of 18 hepatobiliary and gallbladder conditions, (ii) generation of clinically coherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Artificial Intelligence in Healthcare and Education