External Benchmarking of Lung Ultrasound Models for Pneumothorax-Related Signs: A Manifest-Based Multi-Source Study
Takehiro Ishikawa

TL;DR
This study develops a multi-source benchmark for lung ultrasound AI models to evaluate their generalization and task validity, revealing limitations of binary classification in clinical pneumothorax detection.
Contribution
It introduces a manifest-based external benchmark for lung ultrasound models, enabling reproducible evaluation across sources and highlighting the complexity of pneumothorax signs.
Findings
Single-site classifier achieved ROC-AUC 0.9625 in-domain but only 0.7050 externally.
Lung pulse was treated as normal by the model, indicating incomplete binary classification.
Lung point was identified as an intermediate ambiguity state rather than a binary class.
Abstract
Background and Aims: Reproducible external benchmarks for pneumothorax-related lung ultrasound (LUS) AI are scarce, and binary lung-sliding classification may obscure clinically important signs. We therefore developed a manifest-based external benchmark and used it to test both cross-domain generalization and task validity. Methods: We curated 280 clips from 190 publicly accessible LUS source videos and released a reconstruction manifest containing URLs, timestamps, crop coordinates, labels, and probe shape. Labels were normal lung sliding, absent lung sliding, lung point, and lung pulse. A previously published single-site binary classifier was evaluated on this benchmark; challenge-state analysis examined lung point and lung pulse using the predicted probability of absent sliding, P(absent). Results: The single-site comparator achieved ROC-AUC 0.9625 in-domain but 0.7050 on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
