Artificial Intelligence-Based Automated Assessment of the Four-Chamber View in Fetal Cardiac Ultrasound Videos
Naoki Teraya, Masaaki Komatsu, Katsuji Takeda, Kanto Shozu, Naoaki Harada, Reina Komatsu, Akira Sakai, Rina Aoyama, Mayumi Kaneko, Ken Asada, Syuzo Kaneko, Kazuki Iwamoto, Akitoshi Nakashima, Ryu Matsuoka, Akihiko Sekizawa, Ryuji Hamamoto

TL;DR
This paper introduces an AI system that automatically analyzes fetal heart ultrasound videos to detect heart abnormalities, matching the performance of expert doctors.
Contribution
A novel AI framework for automated four-chamber view extraction and biometric calculation in fetal cardiac ultrasound.
Findings
AI models achieved reliable 4CV extraction and accurate biometric computation.
Performance was comparable to expert obstetricians in both normal and abnormal cases.
The system works across different ultrasound systems and reduces missed abnormalities.
Abstract
The clinical application of artificial intelligence (AI) can provide technical support for examiners and improve obstetric workflow efficiency. In this study, we developed AI models that automatically extract the four-chamber view (4CV) from fetal cardiac ultrasound videos and compute the cardiothoracic area ratio, cardiac axis, and cardiac position for prenatal screening of congenital heart disease. Fetal cardiac ultrasound videos from 301 patients in the second trimester were analyzed. The 4CV was automatically extracted using YOLOv7, followed by image segmentation with UNet 3+ and SegFormer, after which automated parameter calculation and estimation were performed. A clinical comparison study involving 22 obstetricians was conducted to evaluate the screening performance of the AI models. The models demonstrated stable performance in both normal and abnormal cases, including…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 4- —Cabinet Office BRIDGE
- —MEXT subsidy for the Advanced Integrated Intelligence Platform
- —National Cancer Center Research and Development Fund
- —JSPS Grant-in-Aid for Scientific Research
- —JST SPRING
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFetal and Pediatric Neurological Disorders · Neonatal and fetal brain pathology · Congenital Heart Disease Studies
1. Introduction
Ultrasound imaging is widely used, particularly in obstetrics, because its noninvasive and real-time capabilities impose minimal burden on both the mother and fetus [1,2]. However, ultrasound image quality depends heavily on the examiner’s expertise because of the manual nature of image acquisition and the presence of acoustic shadows [3]. Artificial intelligence (AI) technologies for medical image analysis have advanced rapidly in recent years [4,5,6]. To mitigate the limited availability of medical image data, approaches such as ensemble learning, transfer learning, and large multimodal models have been widely adopted [7,8,9,10]. In addition, AI is expected to help address workforce shortages and standardize the accuracy of ultrasound examinations [11,12,13]. Previous studies have shown that, after a sonographer manually selects an appropriate four-chamber view (4CV), a representative transverse cardiac plane, AI systems can measure the biparietal diameter, abdominal circumference, and femur length to estimate fetal body weight [14]. AI-based methods have also been used to evaluate the fetal central nervous system by measuring the lateral ventricles and cerebellar width in fetal brain cross-sections to identify abnormalities [15,16]. In adults, semiautomated methods for cardiovascular echocardiography are revolutionizing the field of cardiology. AI has been clinically applied to estimate left ventricular ejection fraction by recognizing the left ventricle in the 4CV [17]. A prospective trial with pre- and post-sequential allocations reported that AI-assisted focused cardiac ultrasound led to a higher rate of treatment plan changes [18]. A deep learning method was proposed for the fully automated quantification of the calcific burden in high-resolution intravascular ultrasound images [19].
Fetal cardiac ultrasound screening is useful to detect congenital heart disease (CHD), a condition that affects approximately 0.8–1.2% of all births and often requires advanced postnatal medical care. Although prenatal detection has improved with advances in ultrasound technology, the global prenatal diagnostic rate of approximately 60% remains insufficient [20,21,22,23]. Because of the shortage of specialists in fetal cardiology, substantial variability in diagnostic performance exists among examiners. Training clinicians to perform detailed cardiovascular examinations requires considerable time, in part because opportunities for screening are limited and because the fetal heart is small and often poorly visualized due to acoustic shadows or fetal motion. During fetal cardiac ultrasound screening, examiners typically observe several standard transverse cardiac planes, including the 4CV, three-vessel view (3VV), and three-vessel tracheal view [24]. A retrospective quality assessment of fetal cardiac ultrasound images in 92 patients with severe CHD reported significant differences in image quality between patients with prenatally detected and undetected CHD [25]. In addition, the cardiothoracic area ratio (CTAR), cardiac axis, and cardiac position (point P) are broadly applicable screening parameters. However, the accuracy of manual measurements of these indices depends on the experience and skill levels of the examiners [26]. When a pregnant woman is examined at a clinic where the examiner can assess only estimated fetal weight, she may not have the opportunity to undergo a comprehensive fetal cardiac evaluation [27]. Therefore, developing AI-based technologies to support screening and to identify pregnant women who require further detailed examination is clinically important.
Few automated methods can extract appropriate frames from ultrasound videos and compute biometric parameters; however, the lack of interpretability of AI-based diagnoses remains a major concern [11]. To assist examiners in fetal cardiac ultrasound screening, we previously proposed a method that visualizes fetal cardiac structures over time using a barcode representation generated by AI [28]. In this study, we aimed to develop an automated AI model that extracts the 4CV from fetal cardiac ultrasound videos acquired during routine prenatal checkups and computes the CTAR, cardiac axis, and point P. To mirror real-world clinical measurements, we employed segmentation models, including UNet 3+ [29] and SegFormer [30]. UNet 3+ is an advanced convolutional neural network (CNN) architecture that has been applied to medical image segmentation. SegFormer is a transformer-based model developed specifically for semantic segmentation. Through a clinical comparison study, we evaluated the performance of these AI models against that of obstetricians in estimating biometric parameters.
2. Materials and Methods
2.1. 4CV Extraction from Fetal Cardiac Ultrasound Videos
Two processing pipelines were developed: one for extracting 4CV images from fetal cardiac ultrasound videos and another for image analysis. The automatic extraction model developed in our laboratory is based on YOLOv2 for cardiac region detection and was subsequently retrained using YOLOv7 [31]. In this study, YOLOv7 was selected for 4CV extraction because it offers high speed, high accuracy, and continuous updates, which are essential for clinical application. The method for recognizing 18 anatomical sites in the thoracic cavity remained unchanged from the previous model: the crux of the heart, ventricular septum, right atrium, tricuspid valve, right ventricle, left atrium, mitral valve, left ventricle, pulmonary artery, ascending aorta, superior vena cava, descending aorta, gastric vesicle, spine, umbilical vein, inferior vena cava, pulmonary vein, and ductus arteriosus. The extraction model was configured to convert each video into still image sets with a resolution of 480 × 640 pixels at 30 frames per second (fps). The foundation model for target detection is integrated into an AI-equipped software that was approved as a medical device by the Pharmaceuticals and Medical Devices Agency of Japan in July 2024 (approval number: 30600BZX00155000). Based on our previous experimental results, the detection threshold for YOLOv7 was adjusted to enable accurate 4CV extraction suitable for fetal cardiac ultrasound screening. For 4CV recognition, the crux of the heart, right atrium, right ventricle, left atrium, left ventricle, and descending aorta were detected simultaneously, with confidence thresholds set to 0.50, 0.30, 0.30, 0.60, 0.30, and 0.40, respectively. All images meeting these thresholds were extracted from each video, and two images per video were randomly selected for the automated analyses described below.
2.2. Dataset
The dataset comprised 301 pregnant women with singleton pregnancies at 18–28 weeks of gestation who underwent fetal ultrasound examinations between 2016 and 2023 at Showa Medical University Hospitals (Supplementary Table S1). Experts or obstetricians with at least 3 years of experience under appropriate supervision performed examinations. Ultrasound examinations were conducted using Voluson^®^ E8, E10, or Expert 22 (GE Healthcare, Zipf, Austria) equipped with 2–6 MHz abdominal transducers, in accordance with established guidelines [32]. Among these, 253 videos from 231 patients with normal fetal cardiac findings acquired using Voluson^®^ E8 or E10 were used to train the models. These sweep videos continuously captured the fetal gastric vesicle and aortic arch over 10–15 s. Each video was divided into frames at 30 fps, and an obstetrician selected one to three images per video that clearly demonstrated an appropriate 4CV, yielding a total of 488 images for training the segmentation models. This dataset was distinct from the dataset used for automated and random 4CV extraction with the preconstructed YOLOv7 model described in Section 2.1.
2.3. Model Structure
Figure 1 provides an overview of the proposed model. Segmentation labels for the heart, ventricular septum, whole thorax, thorax, and descending aorta in the 4CV were assigned by an obstetrician. For the spine, the convex-envelope method was used to approximate its centroid by filling the hollow region within the thoracic segmentation where the spine was located. The spine position was therefore represented by its centroid. Subsequently, the following measurements were described, and parameters were calculated using segmentation labels: a line passing through the centroid of the spine and bisecting the area of the whole thorax, line passing through the centroid of the ventricular septum in the long axis direction, and point P according to the program. Point P is defined as the intersection of the line passing through the ventricular septum and the circumference of the heart. Data augmentation was performed due to the limited amount of training data. Next, we segmented the above target area using UNet 3+ and SegFormer and evaluated the performance of the models using mean values of the dice coefficient (mDice). We compared the predictions and labels pixel-by-pixel to determine true positives (TP), false positives (FP), and false negatives (FN); the Dice results are as follows:
Based on the predictions of the CTAR, cardiac axis, and point P, the performance of the AI models was evaluated.
Model overview. The figure shows the workflow from inputting fetal cardiac ultrasound videos to parameter calculation. YOLOv7 detects four chambers of the heart, crux, and descending aorta in the 4CV and randomly extracts set number of images from the image sets over the confidence threshold. Then, the five structures are segmented using the artificial intelligence model. The spine approximated the centroid filled-in hollow of the thoracic segmentation result using a convex envelope. Define a purple line passing through the centroid of the spine and bisecting the area of the whole thorax, and a yellow line passing through the centroid of the ventricular septum in the long axis direction. Finally, programmed calculation and parameter estimation are performed. 4CV, four-chamber view; Des-Aorta, descending aorta; VS, ventricular septum; CTAR, cardiothoracic area ratio.
2.4. Segmentation
All experiments were performed on NVIDIA A100 GPUs using the PyTorch (v2.10.1) framework. Although previous studies have reported results using other CNNs, improvements in speed, accuracy, and response time remain critical for clinical applications. The primary objective of this study was to compare CNN- and Transformer-based segmentation models; therefore, UNet 3+ and SegFormer were selected because they provide high accuracy and inference speed while maintaining low computational cost. Both UNet 3+ and SegFormer are suitable for fetal cardiac ultrasound screening, as reported previously [33]. The feature extraction capacity and computational characteristics of UNet 3+ and SegFormer were as follows: number of parameters, 47.22 and 26.97 MB; floating-point operations (FLOPs), 198.45 and 18.26 GB; inference speed, 96.37 and 91.45 fps; and input image size, 256 × 256 pixels. Both models were trained using segmentation labels for the heart, ventricular septum, whole thorax, thorax, and descending aorta in the 4CV. The following training settings were applied to both architectures: a batch size of 16 and 200 training epochs. The RAdam optimizer was used with a learning rate of 0.0003 and a decay rate of 0.9, and parameters were updated using a momentum-based method. Early stopping was applied if no improvement was observed for 20 consecutive epochs. Data augmentation included horizontal and vertical flipping, random translation, scaling, and rotation, Gaussian blurring, random occlusion of regions up to 75 × 75 pixels, random brightness and contrast adjustments, and the addition of a single random shadow. A hybrid loss function was used, defined as:
Focal represents the focal loss [34], MS-SSIM represents the multiscale structural similarity index loss [35], IoU represents the intersection over union loss. This loss function is used in the base model of UNet 3+, and we adopted it without modification. The same loss function was applied to SegFormer to standardize the training conditions across both models. Receiver operating characteristic (ROC) analyses were performed using Python (v3.10.12).
2.5. Parameter Calculations
2.5.1. CTAR
In clinical practice, the heart and whole thorax are commonly approximated as ellipses to calculate the CTAR. In this study, however, contour-based labels were used to enable automatic measurement by the AI model. The numbers of segmented pixels within the thoracic area and heart were counted, and CTAR was calculated as follows:
2.5.2. Cardiac Axis
To calculate the cardiac axis, the centroid of the approximated spine was defined as point S (Figure 2). Point R was defined as the point farthest from point S among the intersections between the thoracic outer contour and the line passing through point S that bisects the thoracic area. A line was fitted to the segmented ventricular septum using the least-squares method, and the angle at the intersection point I between this line and line RS was calculated. For the bisecting line RS of the whole thorax, point V was defined as the intersection between the line passing through the centroid of the ventricular septum and the outer contour of the whole thorax on the side closer to the ventricular septum. The model was designed to measure the angle between lines RS and IV regardless of whether the cardiac apex was oriented to the left or right. The angle ∠RIV was defined as the cardiac axis and was calculated as follows:
The fetal cardiac apex direction varies depending on fetal presentation, which is defined as breech when the fetal head is oriented toward the maternal head or cephalic when it is oriented toward the caudal side of the uterus. By examining the x-coordinates of points R and V, the model determines the apex orientation. In a normal fetus in the cephalic position, the apex is located in the left thoracic cavity in the ultrasound view. In the present model, if point V lies in the negative direction relative to line RS, the cardiac apex is classified as being in the left thoracic cavity and is defined as the cephalic position (Figure 3). In cases of CHD in which the apex is located in the right thoracic cavity, the model classifies the orientation as breech. The model, therefore, identifies abnormal cardiac orientation when the estimated orientation is opposite to the examiner’s assessment because it determines fetal presentation simultaneously with cardiac axis measurement.
An abnormality assessment method was further introduced based on the relative positions of the cardiac apex and the descending aorta. In normal anatomy, the apex and descending aorta are located on the same side of the thoracic cavity in both cephalic and breech positions [32]. Using this relationship, conditions in which these structures lie on opposite sides of the thoracic cavity, such as right-sided aortic arch (RAA), can be detected. When this condition is present, visceral malposition or deviation of the cardiac axis is suspected; however, this approach cannot distinguish between right- and left-sided isomerism. As shown in Figure 3, point A was defined as the intersection between the line passing through point S and the centroid of the descending aorta and the thoracic circumference on the side opposite to point S relative to the descending aorta. In the normal case, point V lies on the arc AR, which the following angular relationship can express:
If the heart and descending aorta are located in opposite thoracic cavities, as shown in Figure 4, the relationship becomes:
Depending on image quality, the descending aorta may appear very close to line RS even in normal cases. However, because the primary purpose of this study was screening, no numerical threshold was applied. Abnormalities were therefore determined using the relationships defined above.
Location of the heart and descending aorta in patient with congenital heart disease (CHD) (right aortic arch). Original (a) and segmented and processed images (b). Pattern diagram indicates that the heart and descending aorta are located in the opposite thoracic cavity (c). A straight line drawn from point S to the other three points, respectively, is described as follows: patient with CHD: ∠VSA = ∠RSV + ∠RSA. A, The intersection of the extension pink line connecting the spine and descending aorta with the thoracic circumference as the descending aorta substitution; R, Intersection of the thoracic circumference with an orange line passing through the spine and bisecting the whole thorax; S, Spine; V, Intersection of a line through the centroid of the ventricular septum and the thoracic circumference, closer to the ventricular septum as the apex substitute. The extension purple line passes through S and V.
2.5.3. Point P
Point P was originally defined as the intersection of the atrial septum and the atrial wall [36]. To approximate its location in the present study, point P (p_x_, p_y_) was defined as the intersection between line IV and the segmented cardiac circumference on the side opposite the ventricular septum. The centroid of the whole thorax was defined as point O′ (c_x_, c_y_), and the x′ and y′ axes were centered at point O′. The probability that the coordinate of point P was within (0 ≤ p_x_ ≤ 2, −1 ≤ p_y_ ≤ 1) was approximately 96.3% (Figure 5) [36]. In the relative coordinate axes x′ and y′, the points of intersection with the circumference of the entire thorax are defined as B1, B2, A1, and A2, respectively.
①To construct the relative coordinates, point O′ (c_x_, c_y_) is moved to the origin O (0, 0) on the relative coordinate axes X and Y. Point P (p_x_, p_y_) is moved in a similar manner.②The x′ and y′ axes are tilted by an angle θ due to the tilt of the thorax, so they are rotated horizontally.③The y′ axis is reversed to the negative direction.④Lines B1B2 and A1A2 of the relative coordinate axes are divided into eight equal parts, and the scale is then adjusted.
The normal reference ranges were defined according to the Practice Guidelines of the International Society of Ultrasound in Obstetrics and Gynecology: CTAR < 35%, cardiac axis 45 ± 20° toward the left thorax, and point P evaluated using the relative coordinate system centered at the thoracic centroid, as described above [32,36,37]. We assessed whether point P lay outside the normal coordinate range. For the primary endpoint, parameter values obtained from the AI models were compared with those measured by obstetricians to determine equivalence.
2.6. Performance Evaluation of the AI Models
For automatic image extraction using YOLOv7, images were considered unsuitable for evaluation if the 4CV could not be recognized and extracted. First, image extraction was performed on 35 normal fetal cardiac ultrasound videos from 33 patients acquired using Voluson^®^ E8 or E10. Second, 10 normal fetal cardiac ultrasound videos from 10 patients acquired using the newer Voluson^®^ Expert 22 were analyzed to evaluate model generalizability across devices. Each extracted image was segmented using UNet 3+ and SegFormer, and the mDice and other parameters were calculated from the segmentation results. Because only normal cases were used to train the AI models, both the internal and external validation datasets consisted exclusively of normal cases.
To evaluate model performance in patients with CHD, 22 fetal cardiac ultrasound videos from 22 patients acquired using Voluson^®^ E8, E10, or Expert 22 were analyzed. The diagnoses included tetralogy of Fallot (TOF), Ebstein’s anomaly, hypoplastic left heart syndrome, pulmonary atresia with an intact ventricular septum, tricuspid atresia, transposition of the great arteries, congenital cystic adenomatoid malformation, congenital diaphragmatic hernia, double-outlet right ventricle, ventricular septal defect, and RAA. As in the normal cohort, YOLOv7 automatically and randomly extracted two 4CV images from each video. Videos from which no 4CV image could be extracted were classified as abnormal. The extracted images were segmented using UNet 3+ and SegFormer, and all parameters were subsequently calculated.
2.7. Clinical Comparison Study
A comparative study between the AI models and 22 obstetricians was conducted at Showa Medical University Hospital and Showa Medical University Koto Toyosu Hospital to determine whether parameter errors were clinically acceptable. The obstetricians were divided into three groups: experts (n = 3; 19, 24, and 31 years of experience), fellows (n = 7; 6–11 years of experience), and residents (n = 12; 3–5 years of experience). All participants were provided with fetal ultrasound videos acquired using Voluson^®^ E8, E10, or Expert 22, including 10 normal cases and 10 cases with CHD. Each obstetrician was asked to extract one thoracic image from each video that contained the optimal 4CV for measurement. To replicate the clinical workflow, each extracted image was segmented for the whole thorax and heart using the elliptical method, with a straight line bisecting the thoracic area through the spine and another bisecting the ventricular septum, and point P was marked [37,38]. The calculation algorithm then automatically derived the CTAR and cardiac axis. Point P was visually assessed for whether it lay within the normal range using the automatically generated thorax-based coordinate system. After collecting all responses, each parameter was calculated within the program.
The AI results for UNet 3+ and SegFormer were computed using 2 images extracted from the same 20 videos evaluated by the obstetricians. To assess screening performance, the results of these two models were compared with those obtained by the obstetricians. Model performance was summarized using ROC curves. After confirming homogeneity of variances using Levene’s test, one-way analysis of variance (ANOVA) or Welch’s ANOVA was applied to test for significant differences between the obstetricians’ and AI models’ parameter values (p < 0.05). When significant differences were detected, Tukey’s honestly significant difference test for homoscedasticity or the Games–Howell test was performed.
3. Results
3.1. Model Structure and Internal Validation
The testing dataset comprised 58 images automatically extracted from normal fetal cardiac ultrasound videos using YOLOv7. Images that could not be segmented automatically because of poor image quality were excluded from the mDice calculation for each structure. The mDice values for the heart, ventricular septum, whole thorax, thorax, and descending aorta were 0.923, 0.783, 0.949, 0.946, and 0.658, respectively, for UNet 3+ and 0.928, 0.776, 0.951, 0.949, and 0.690, respectively, for SegFormer (Table 1). For UNet 3+, the mean predictions ± standard deviation (SD) were 26.9 ± 3.5% for CTAR and 41.7 ± 9.8° for the cardiac axis, with mean absolute errors (MAEs) of 2.7% and 5.4°, respectively, relative to the ground truth. For SegFormer, the corresponding values were 27.1 ± 3.5% for CTAR and 41.6 ± 10.0° for the cardiac axis, with MAEs of 2.7% and 5.6°, respectively (Table 2). Supplementary Figure S1 shows the distribution of point P, with 90.0–95.6% of measurements falling within the normal range in the testing dataset. The performance of UNet 3+ and SegFormer was comparable.
3.2. Evaluation of Differences Between Ultrasound Equipment Using External Validation Dataset
For external validation to assess differences between ultrasound equipment, we analyzed 18 images automatically extracted from normal fetal cardiac ultrasound videos acquired using a different ultrasound machine with YOLOv7. As shown in Table 3, the mDice values for the heart, ventricular septum, whole thorax, thorax, and descending aorta were 0.922, 0.740, 0.941, 0.950, and 0.758, respectively, for UNet 3+ and 0.931, 0.721, 0.945, 0.944, and 0.795, respectively, for SegFormer. For biometric parameter estimation in the external validation dataset, UNet 3+ yielded values of 26.5 ± 3.9% for CTAR and 37.7 ± 14.3° for the cardiac axis, with MAEs of 2.2% and 5.0°, respectively, relative to the ground truth. SegFormer yielded values of 27.1 ± 3.8% for CTAR and 38.8 ± 13.6° for the cardiac axis, with MAEs of 2.7% and 5.0°, respectively (Table 4).
3.3. Evaluation Results of Images of Patients with CHD
In patients with CHD, 30 images were extracted from 15 patients because 7 of the 22 patients were not recognized as having a normal 4CV and therefore could not be extracted. Of these 30 images, UNet 3+ classified 19 images and SegFormer classified 20 images as abnormal based on abnormalities in at least one of three parameters: CTAR, cardiac axis, or point P. Patients without detected abnormalities showed morphological abnormalities confined to the heart or the vascular system. Across all 22 CHD cases, the per-patient sensitivity, counting non-extractable 4CV images as screening positive, was 0.773 (17/22) for UNet 3+ and 0.818 (18/22) for SegFormer. Figure 6 shows the results of automated 4CV assessment in three patients with CHD: pulmonary atresia with an intact ventricular septum (a), transposition of the great arteries (b), and TOF (c). RAA complicates approximately 40% of patients with TOF. In this study, patients with RAA in whom the descending aorta and cardiac apex were located in opposite thoracic regions (referred to here as “inversus”) were classified as abnormal [38,39].
3.4. Clinical Comparison Study Between Obstetricians and AI Models
We compared biometric parameters derived by obstetricians and the AI models with the ground-truth labels. Supplementary Figure S2 illustrates variation in 4CV image extraction by obstetricians and YOLOv7. Experts and YOLOv7 tended to extract similar images, whereas residents showed greater variability in image selection. In 10 normal cases, the mean ± SD values for experts, fellows, and residents were 24.5 ± 4.1%, 22.5 ± 5.2%, and 22.7 ± 5.0%, respectively, for the CTAR and 41.8 ± 7.6°, 42.9 ± 11.0°, and 40.8 ± 14.7°, respectively, for the cardiac axis (Figure 7 and Table 5). For UNet 3+ and SegFormer, CTAR was 27.2 ± 3.4% and 27.3 ± 3.2%, respectively, and the cardiac axis was 40.2 ± 7.6° and 40.6 ± 7.3°, respectively. For point P, the proportion of incorrect cases in the 10 normal cases is shown in Table 5, with a maximum value of 11.7%. Compared with the labels, CTAR tended to be smaller for obstetricians and larger for the AI models. However, the Games–Howell test showed no significant differences between the label and any group in the normal cases. No significant differences were observed for the cardiac axis (Welch’s ANOVA, p = 0.617) (Supplementary Table S2).
3.5. AI Models Achieve a Screening Performance Equivalent to That of Experts
Figure 8 and Supplementary Figure S3 present ROC curves illustrating the screening performance of experts, fellows, residents, and the AI models. The area under the curve (AUC) analysis indicated that the CTAR and cardiac axis contributed more to screening performance than point P. Screening based on CTAR and the cardiac axis achieved AUC values of 0.816 (95% CI, 0.699–0.913) for experts, 0.835 (95% CI, 0.775–0.893) for UNet 3+, and 0.851 (95% CI, 0.789–0.904) for SegFormer. When the CTAR, cardiac axis, and point P were combined, the performance further improved to AUC values of 0.860 (95% CI, 0.754–0.943), 0.841 (95% CI, 0.778–0.897), and 0.861 (95% CI, 0.804–0.910) for experts, UNet 3+, and SegFormer, respectively. These results indicate that the AI models achieved performance comparable to that of experts. SegFormer achieved the same AUC as the experts, and its ROC curve intersected that of the experts (Figure 8 and Table 6). Using the Youden index for experts, the clinically acceptable false-positive rate was 0.300. At this operating point, sensitivity was 0.867 for experts and 0.816 for SegFormer. These findings suggest that experts achieved higher sensitivity, whereas SegFormer achieved higher specificity.
4. Discussion
The skill level of examiners does not improve rapidly, and continuous training is required for fetal cardiac ultrasound examinations [26,39]. This study was designed to simulate a clinical scenario in which an examiner can perform basic prenatal screening using an AI model. Several studies have reported that CNN models can classify individual cross-sectional frames extracted from cardiac sweep videos and detect abnormalities [27,40,41]. However, in those studies, diagnoses were not made by physicians, and even when abnormalities were detected, their underlying causes were not clearly defined. In contrast, Liang et al. applied the CTAR and cardiac axis using segmentation methods [42], and Taksøe-Vester et al. developed a segmentation-based AI model for screening fetal coarctation of the aorta [43]. In addition, AI-based automated assessment of the pulmonary artery–to–ascending aorta ratio in the 3VV has been reported [33]. Furthermore, a novel approach using three-dimensional segmentation models has been proposed for prenatal ultrasound screening of total anomalous pulmonary venous connection [44].
In this study, biometric parameter calculations were designed to align as closely as possible with the CTAR and cardiac axis measurement methods used in clinical practice, thereby facilitating acceptance of the AI system by potential users. To address the domain-shift problem, variability in ultrasound devices and gestational age was restricted. Accordingly, this study verified compatibility between a newer ultrasound device and a widely used model from the same vendor. In the automated 4CV assessment model, the screening workflow comprised the following steps: (1) automatic detection and extraction of the 4CV using YOLOv7; (2) calculation of biometric parameters, including the CTAR, cardiac axis, and point P; (3) confirmation of fetal position (cephalic or breech); and (4) confirmation of the positions of the descending aorta and cardiac apex (solitus or inversus). Examiners can screen for 4CV morphological abnormalities during the initial YOLOv7-based detection and through subsequent identification of abnormal parameter values. Screening results can be shared efficiently between examiners and experts, which may accelerate the initial evaluation and facilitate timely referral for secondary examination. Final determination of disease-relevant abnormalities, however, remains the responsibility of a skilled examiner.
We compared UNet 3+ and SegFormer and found that both achieved comparable performance in terms of mDice and parameter estimation accuracy. With respect to network architecture and computational efficiency, UNet 3+ has fewer parameters and faster inference speed, whereas SegFormer requires lower computational complexity. Both architectures were selected because they enable accurate extraction of spatial relationships during feature learning; however, segmentation performance alone did not allow a definitive determination of superiority. Nevertheless, a clinical comparison study suggested that SegFormer performed slightly better than UNet 3+. For both the CTAR and cardiac axis, the MAEs of the AI models were similar to those of obstetricians, and these discrepancies were considered clinically acceptable. Regarding the range of these parameters in normal cases, no significant differences were observed among the labels, experts, or AI models. In addition, the AI models showed accuracy comparable to that of a related study (CTAR; 31.0 ± 4.0%, cardiac axis; 32.0 ± 8.2°) [42]. Point P is rarely used in routine clinical practice because of the difficulty of manual measurement; however, the availability of computational resources and AI assistance may enable more frequent use of this parameter in the future. A clinical comparison study indicated that fetal cardiac ultrasound screening based on the CTAR, cardiac axis, and point P achieved a performance comparable to that of experts. In addition, this study demonstrated greater variability in 4CV extraction among residents than among experts, suggesting that AI-assisted biometric parameter calculation and standardized 4CV extraction may particularly benefit less experienced examiners. Our fully automated AI models can reduce missed abnormalities and standardize screening accuracy, thereby leading to improved prenatal diagnostic rates.
Although increasing the number of training images can improve model accuracy, identification of the ventricular septum and descending aorta remains challenging in low-quality images, which limits overall performance. Moreover, the fundamental method of fetal ultrasound examination is unlikely to change without major technological advances in ultrasound equipment. Automated 4CV assessment systems, including the AI models proposed here, therefore remain highly dependent on examiner performance. Images that AI easily analyzes are also generally easy for humans to interpret. Consequently, continued efforts to improve examiner skill in acquiring high-quality images remain essential. Higher quality training data will lead to further improvements in AI performance.
This study has some limitations. First, although the accuracy of the device used in this study and its newer models was verified, measurement accuracy may decline when ultrasound devices from other vendors or devices with lower image quality are used. Second, because the training data consisted of fetal cardiac ultrasound videos acquired between 18 and 28 weeks of gestation, the generalizability of the models to other gestational ages requires further validation. Fetal cardiac structures are smaller and less clear in the first trimester, making analysis more difficult to perform. However, the structures become clearer in the third trimester and the screening performance is close to that of a diagnosis. In future studies, to expand the scope of evaluation to the first and third trimesters for clinical applications, new AI models or multimodal methods should be explored. Third, interobserver and intraobserver variability in the segmentation labels was not assessed, which may have introduced bias into both the mDice and the MAE. Fourth, because images for the CTAR were randomly extracted from videos, systolic and diastolic frames were analyzed together. Fifth, the validation and test results were based on the average of two extracted images; however, the optimal number of images required for this approach remains to be determined. With continued advances in AI, automated systems are expected to extract appropriate cross-sectional views from multiple sites with fewer probe movements and to compute multiple parameters automatically. Finally, the fully automated AI models were built on existing standard architectures. Future work should explore architectural modifications that avoid potential patent infringement during clinical deployment. In addition, ablation studies should be performed when updating the architecture, using state-of-the-art YOLO-series or segmentation models.
5. Conclusions
We developed fully automated AI models that extract 4CV images from fetal cardiac ultrasound videos and quantify the CTAR, cardiac axis, and point P. In the clinical comparison study, screening performance based on all three biometric parameters achieved AUC values of 0.860, 0.841, and 0.861 for experts, UNet 3+, and SegFormer, respectively. These models are expected to reduce missed abnormalities and to improve the standardization of examination accuracy. In a clinical scenario, the models could support fetal cardiac ultrasound screening, with the final diagnosis made by an obstetrician when abnormal parameter values are identified. The integration of AI is also expected to enable more accurate assessments through the use of higher-quality images and to encourage examiners to acquire optimal input images. Consequently, the skill levels of examiners using these models may become both improved and more consistent.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Lee L.H. Bradburn E. Craik R. Yaqub M. Norris S.A. Ismail L.C. Ohuma E.O. Barros F.C. Lambert A. Carvalho M. Machine learning for accurate estimation of fetal gestational age based on ultrasound imagesnpj Digit. Med.202363610.1038/s 41746-023-00774-236894653 PMC 9998590 · doi ↗ · pubmed ↗
- 2He F. Li G. Zhang Z. Yang C. Yang Z. Ding H. Zhao D. Sun W. Wang Y. Zeng K. Transfer learning method for prenatal ultrasound diagnosis of biliary atresianpj Digit. Med.2025813110.1038/s 41746-025-01525-140021764 PMC 11871324 · doi ↗ · pubmed ↗
- 3Meng Q. Sinclair M. Zimmer V. Hou B. Rajchl M. Toussaint N. Oktay O. Schlemper J. Gomez A. Housden J. Weakly Supervised Estimation of Shadow Confidence Maps in Fetal Ultrasound Imaging IEEE Trans. Med. Imaging 2019382755276710.1109/TMI.2019.291331131021795 PMC 6892638 · doi ↗ · pubmed ↗
- 4Campanella G. Hanna M.G. Geneslaw L. Miraflor A. Werneck Krauss Silva V. Busam K.J. Brogi E. Reuter V.E. Klimstra D.S. Fuchs T.J. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images Nat. Med.2019251301130910.1038/s 41591-019-0508-131308507 PMC 7418463 · doi ↗ · pubmed ↗
- 5Yamada Y. Kobayashi M. Shinkawa K. Bilal E. Liao J. Nemoto M. Ota M. Nemoto K. Arai T. Utility of synthetic musculoskeletal gaits for generalizable healthcare applications Nat. Commun.202516618810.1038/s 41467-025-61292-140615372 PMC 12227639 · doi ↗ · pubmed ↗
- 6Rao V.M. Hla M. Moor M. Adithan S. Kwak S. Topol E.J. Rajpurkar P. Multimodal generative AI for medical image interpretation Nature 202563988889610.1038/s 41586-025-08675-y 40140592 · doi ↗ · pubmed ↗
- 7You J. Zhang S. Zhang J. Chen Y. Zhang M. Zhou C. Jiang B. Ensemble learning for predicting microsatellite instability in colorectal cancer using pretreatment colonoscopy images and clinical data Front. Oncol.202515173407610.3389/fonc.2025.173407641551159 PMC 12807959 · doi ↗ · pubmed ↗
- 8Khullar V. Abbas M. Kansal I. Ksibi A. Gupta G. Gupta D. Juneja S. Nauman A. Low resource federated learning for classification of nail disease by deploying cross-silo and heterogeneously dataset distributions Sci. Rep.202616770710.1038/s 41598-026-36848-w 41654538 PMC 12946191 · doi ↗ · pubmed ↗
