LungSurg: A Generative AI System for Segmentation and Phase Classification in Thoracoscopic Lobectomy

Hengrui Liang; Zeping Yan; Yudong Zhang; Keyao Dai; Hongyan Li; Jianfei Shen; Pengfei Li; Jipeng Jiang; Guochao Zhang; Xiang Zhang; Hao Chen; Honglang Zhang; Yuzhuo Zhang; Shujun Liang; Minsheng Chen; Xin Wang; Anyi Rao; Wei Wang; Lei Zhao; Yuchen Guo; Jianxing He

PMC · DOI:10.1002/mco2.70613·January 20, 2026

LungSurg: A Generative AI System for Segmentation and Phase Classification in Thoracoscopic Lobectomy

Hengrui Liang, Zeping Yan, Yudong Zhang, Keyao Dai, Hongyan Li, Jianfei Shen, Pengfei Li, Jipeng Jiang, Guochao Zhang, Xiang Zhang, Hao Chen, Honglang Zhang, Yuzhuo Zhang, Shujun Liang, Minsheng Chen, Xin Wang, Anyi Rao, Wei Wang, Lei Zhao, Yuchen Guo, Jianxing He

PDF

Open Access

TL;DR

LungSurg is an AI system that helps in lung cancer surgery by identifying anatomical structures and surgical phases, improving accuracy and training outcomes.

Contribution

LungSurg introduces a novel AI system for VATS lobectomy with high anatomical sensitivity and educational benefits for surgical residents.

Findings

01

LungSurg achieved high segmentation accuracy for left and right lungs in VATS lobectomy.

02

The classification network showed strong Top-1 and Top-3 phase recognition accuracy.

03

Residents trained with LungSurg improved their surgical skills more than those using conventional methods.

Abstract

The integration of artificial intelligence (AI) into surgical practices is advancing towards greater intelligence and precision. This study assesses the potential of AI in video‐assisted thoracoscopic surgery (VATS) lobectomy for lung cancer by developing an AI system named LungSurg. LungSurg comprises two interconnected networks: a segmentation network for identifying intrathoracic anatomy and surgical instruments, and a classification network for recognizing surgical phases. We prospectively collected 222 VATS lobectomy videos from eight centers, generating over 32,000 annotations and more than one million frames with phase information. In external validation, the segmentation network achieved mean Average precision scores of 0.745 for the left lung and 0.726 for the right lung across various instruments and anatomical structures. The classification network demonstrated Top‐1 and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

lung cancer

Figures6

Click any figure to enlarge with its caption.

The study design and the establishment of the artificial intelligence model. (A) The workflow and data requirements for the LungSurg. (B) The whole system architecture of the LungSurg. ①SurgSeg’s judgment confidence for each pixel of the image; ② the result of the dot product between the original image and the segmentation mask given by SurgSeg; ③ image segmentation features generated by the encoder of SurgSeg; ④ image features generated by the encoder of the pre‐trained network.

The representative annotations and predictions of all labels in different surgical structures.The distribution of the annotated segmentation and the proportion of each category in surgical phase. (A) Annotation in training set of segmentation; B: Annotation and prediction in validation set of segmentation. (C) Annotation in training set of phase Classification; D: Prediction in validation set of phase Classification.

The representative annotations and predictions of all labels in different surgical structures.

The segmentation performance of the LungSurg in the overall, the left and the right Video‐assisted Thoracoscopic Lobectomy in the external validation set. (A) mAP bar chart for single‐center vs. multi‐center trained models on external validation set. (B) PRC (whole, instrument, anatomy) for SurgSeg (right lung) results. (C) PRC (whole, instrument, anatomy) for SurgSeg (left lung) results. The “R” refers to the right pulmonary lobectomy, and the “L” indicates the left side. mAP, mean Average Precision; PRC, Precision‐Recall Curve.

Visualization and expert‐AI comparison of the LungSurg Model. (A) Phase Classification Result Compared to Ground Truth in a Color‐Coded Ribbon Illustration. The Horizontal Axis Represents Time Progression During Surgery. The Upper Ribbon is the Ground Truth, and the Lower is the Predicted Phase. Different Colors Represent Different Phases Corresponding to Figure 2C and 2D. (B) Designed Software Showing Segmentation and Phase Classification Predicted by LungSurg in Real Surgery. The colored outlines represent the contours of the segmentation masks generated by SurgSeg. (C) Expert‐ LungSurg comparison. AI, Artificial Intelligence; PIC, Precision of Instance Classification; PID, Precision of Instance Detection; SID, sensitivity of instance detection.

Demonstration of the annotated surgical structures and surgical phase involved in the Video‐assisted Thoracoscopic Lobectomy. (A) Presentation of the instruments. a Electrocoagulation; b Ultrasonic knife; c Stapler; d Suction; e Clamp. (B) Presentation of the thoracic anatomy and surgical areas relative to lobectomy with different phases. LLL, left lower lobe; LN, lymph node; LUL, right upper lobe; PA, pulmonary artery; PV, pulmonary vein; RLL, right lower lobe; RML, right middle lobe; RUL, right upper lobe.

Tables2

TABLE 1. The overall segmentation performance of the LungSurg in internal and external validation sets.

	Internal validation		External validation
Classes	Left	Right	Left	Right
Overall	0.816	0.780	0.745	0.726
Overall instruments	0.937	0.902	0.806	0.755
Electrocoagulation	0.972	0.973	0.864	0.797
Ultrasonic knife	0.948	0.950	0.777	0.719
Stapler	0.909	0.833	0.868	0.762
Suction	0.955	0.921	0.788	0.756
Clamp	0.903	0.838	0.700	0.688
Overall anatomy	0.724	0.691	0.670	0.679
Common features
Pulmonary artery (PA)	0.710	0.653	0.607	0.615
Pulmonary vein (PV)	0.742	0.678	0.601	0.732
Bronchus (B)	0.615	0.671	0.667	0.575
Lymph node (LN)	0.694	0.645	0.777	0.751
Right side
Vena azygos	—	0.810	—	0.693
Left side
aorta	0.858	—	0.723	—

TABLE 2. The overall classification performance of the LungSurg in internal and external validation sets.

	Internal validation		External validation
	Top 1	Top 3	Top 1	Top 3
Overall	0.760	0.929	0.715	0.880
Wedge resection	0.642	0.900	0.625	0.910
Mobilize the lung by dissecting ligament/adhesion band	0.701	0.902	0.700	0.856
Management of the pulmonary vein	0.894	0.993	0.796	0.988
Management of the fissure	0.745	0.954	0.770	0.950
Management of the pulmonary artery	0.791	0.998	0.785	0.804
Management of the lymph nodes	0.822	0.983	0.702	0.964
Operative field management	0.932	0.995	0.909	0.995
Management of the bronchus	0.872	0.951	0.678	0.810
Resection the targeted lung lobe	0.681	0.933	0.763	0.879
Leakage examination/operative field cleanness	0.737	0.897	0.785	0.900
Nerve block	0.608	0.902	0.539	0.870
Extrathoracic scenes	0.903	0.939	0.755	0.848
Waiting	0.671	0.776	0.661	0.803
Lung Inflation for targeted bronchus identification	0.646	0.874	0.544	0.737

Funding3

—Grant R&D Program of Guangzhou National Laboratory
—2025 Engineering Science and Technology Academic Symposium Project of the Chinese Academy of Engineering
—National Natural Science Foundation of China10.13039/501100001809

Keywords

deep learninggenerative artificial intelligencelobectomysurgical educationthoracic surgery

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLung Cancer Diagnosis and Treatment · AI in cancer detection · COVID-19 diagnosis using AI

Full text

Introduction

1

Lung cancer remains the leading cause of cancer‐related mortality worldwide, with approximately 2.2 million new cases and 1.8 million deaths annually [1]. For early‐stage non‐small cell lung cancer (NSCLC), video‐assisted thoracoscopic surgery (VATS) lobectomy with systematic lymph node dissection is the standard treatment, offering oncological outcomes comparable to open thoracotomy but with benefits such as reduced postoperative pain, shorter hospital stays, and faster recovery [2, 3].

Mastering VATS lobectomy poses considerable challenges due to its technical complexity and the intricate anatomy of the thoracic cavity. Surgeons require a substantial number of cases to achieve proficiency, highlighting a steep learning curve [4]. For surgical residents, the initial phase of training involves acquiring comprehensive anatomical knowledge and understanding the sequential steps of the operation [5]. However, limited access to real‐time mentorship and detailed educational resources often hampers this learning process, underscoring the need for innovative training tools [6].

Artificial intelligence (AI) has emerged as a promising solution to enhance surgical training and intraoperative performance [7]. In less complex procedures like laparoscopic cholecystectomy and cataract surgery, video‐based AI systems have successfully segmented anatomy, identified instruments, and classified procedural steps, contributing to improved surgical outcomes [8, 9, 10]. Our team has previously introduced innovative AI solutions: SurgSmart, an AI system for quality control [11]; Surgesture, offering objective skill assessment through novel instrumentation [12]; and algorithms for objective skill classification based on surgical gestures [13]. However, applying AI to complex procedures like VATS lobectomy remains underexplored due to high anatomical variability and intraoperative factors such as occlusion by instruments, smoke, and tissue deformation [14].

Advanced techniques like image inpainting–which uses generative learning to reconstruct missing or occluded image parts–can enhance AI's interpretation of complex surgical fields [15]. In thoracic surgery, this approach restores visual continuity when key structures are obscured, potentially improving segmentation and phase classification in VATS lobectomy.

In this study, we developed LungSurg, an AI system tailored for VATS lobectomy. LungSurg integrates image inpainting to address occlusion and anatomical variability, enabling precise segmentation of anatomical structures and surgical instruments, as well as accurate procedural phase classification throughout the operation. We validated LungSurg in a multicenter cohort, demonstrating its feasibility and effectiveness. By applying advanced generative AI techniques, our work aims to lay the foundation for future applications in automated quality assessment and enhanced surgical training, which may ultimately contribute to improved surgical practice.

Results

2

Video Characteristics and Annotations

2.1

In this multicenter study, involving eight centers, we analyzed a total of 222 VATS lobectomy videos encompassing complete surgical procedures–from endoscope insertion to wound closure (Figure 1A). These videos included all lung lobes (right upper lobe [RUL], right middle lobe [RML], right lower lobe [RLL], left upper lobe [LUL], and left lower lobe [LLL]) and were performed by senior thoracic surgeons, ensuring a comprehensive representation of surgical practices across different institutions.

The study design and the establishment of the artificial intelligence model. (A) The workflow and data requirements for the LungSurg. (B) The whole system architecture of the LungSurg. ①SurgSeg’s judgment confidence for each pixel of the image; ② the result of the dot product between the original image and the segmentation mask given by SurgSeg; ③ image segmentation features generated by the encoder of SurgSeg; ④ image features generated by the encoder of the pre‐trained network.

The dataset was divided into a training set comprising 165 videos (Table S1a) from four centers and a validation set of 57 videos (Table S1b). After frame extraction at a rate of one frame per second (fps), approximately 1.5 million frames were generated for analysis. A total of 32,027 surgical structures were manually annotated, averaging about 144 annotations per video. The distribution of annotations was consistent between the training and validation sets, as well as predicted labels, providing a diverse and balanced dataset for model development (Figure 2A,B).

The representative annotations and predictions of all labels in different surgical structures.The distribution of the annotated segmentation and the proportion of each category in surgical phase. (A) Annotation in training set of segmentation; B: Annotation and prediction in validation set of segmentation. (C) Annotation in training set of phase Classification; D: Prediction in validation set of phase Classification.

All frames were labeled with corresponding phase information, and a total of 14 definitive surgical phases were identified (Table S1, Figure S1). The duration of each phase varied due to differences in surgical techniques and patient anatomy. Core surgical steps, such as the management of bronchovascular structures, lymphadenectomy, fissure dissection, and lobe resection, accounted for the largest proportion of surgical time (Figure 2C,D).

Interobserver Agreement on Annotations

2.2

To ensure high‐quality and consistent annotations, we employed a double‐check method complemented by validation from a third expert. Six certified thoracic surgeons were divided into two groups, with videos randomly allocated in a 1:1 ratio. Within each group, surgeons collaborated to reach consensus on annotations for each video. The annotation consistency in the segmentation task was assessed using metrics such as positive identification rate (PID), positive identification confidence (PIC), and sensitivity identification rate (SID), yielding values of 0.952, 0.992, and 0.971, respectively. For the surgical phase classification task, the frame‐level agreement Cohen's kappa score was 0.948 overall, with 0.942 for right lobectomy and 0.951 for left lobectomy. These high agreement scores indicate excellent consistency in annotation quality across centers.

Performance of LungSurg on the Segmentation Task

2.3

The distribution of predicted labels closely matched the annotated ground truth in the validation set (Figure 2B). Visual comparisons of the model's predictions with manual annotations confirmed the high segmentation performance (Figure 3).

The representative annotations and predictions of all labels in different surgical structures.

In the external validation set from three centers, the segmentation network (SurgSeg) of LungSurg demonstrated robust performance across all structure categories. The mean average precision (mAP) at an intersection over union (IoU) threshold of 0.5 was 0.745 for the left lung and 0.726 for the right lung for all structures (Table 1). Specifically, the mAP of anatomical structures was 0.670 for the left lung and 0.679 for the right lung. Regarding surgical instruments, the mAPs were 0.806 for the left lung and 0.755 for the right lung (Table 1). Major anatomical structures, such as pulmonary vessels and bronchi, achieved mAP values ranging from 0.575 to 0.777, indicating accurate segmentation despite anatomical variability and occlusions (Table 1).

Figure 4A and Table S2 demonstrate that the model trained on the multi‐center dataset outperforms the single‐center model in both anatomical and instrument segmentation (mAP). Figure 4B further indicates that the multi‐center trained model significantly outperforms the single‐center model. Figure S2 and Table S3 illustrate the model's segmentation performance on surgical anatomy and instruments across three different external validation centers, demonstrating a high degree of visual consistency.

The segmentation performance of the LungSurg in the overall, the left and the right Video‐assisted Thoracoscopic Lobectomy in the external validation set. (A) mAP bar chart for single‐center vs. multi‐center trained models on external validation set. (B) PRC (whole, instrument, anatomy) for SurgSeg (right lung) results. (C) PRC (whole, instrument, anatomy) for SurgSeg (left lung) results. The “R” refers to the right pulmonary lobectomy, and the “L” indicates the left side. mAP, mean Average Precision; PRC, Precision‐Recall Curve.

To evaluate the impact of algorithmic improvements, we conducted an ablation study (Table S4) and plotted PRCs for 10‐fold cross‐validation (Figure S3). The baseline ResNet‐50 model attained mAP values of 0.766 on the left and 0.672 on the right. Integrating the Feature Pyramid Network (FPN) improved these results to 0.768 and 0.685, respectively, while further adding non‐local (NL) modules raised them to 0.776 and 0.724, and subsequently incorporating multi‐head self‐attention (MHSA) achieved 0.789 and 0.745, highlighting the effectiveness of these architectures in enhancing the backbone structure. Building upon these improvements, applying Masked Autoencoders (MAE) pre‐training and introducing a co‐occurrence loss function yielded final mAP scores of 0.816 on the left and 0.780 on the right, demonstrating the cumulative benefits of these proposed techniques in enhancing detection performance.

Performance of LungSurg on the Classification Task

2.4

For the surgical phase classification task, the classification network (SurgClass) achieved a Top‐1 accuracy of 0.715 and a Top‐3 accuracy of 0.880 on the external validation set (Table 2). The distribution of predicted phases closely mirrored the annotations (Figure 2C,D), demonstrating the model's ability to generalize across different surgical styles and patient anatomies.

Notably, key procedural steps involving bronchovascular management–such as the handling of the pulmonary artery (PA), pulmonary vein (PV), bronchus, and lymph nodes (LNs)–achieved high Top‐1 accuracies ranging from 0.678 to 0.796 and Top‐3 accuracies from 0.804 to 0.988. These results indicate the model's proficiency in recognizing critical phases essential for surgical success and patient safety. Table S5 demonstrates that the model trained on the multi‐center dataset outperforms the single‐center model in both Top 1 and Top 3 classification performance.

To visualize the model's accuracy, we compared the phase classification results with the ground truth for a representative case using a color‐coded ribbon illustration (Figure 5A). The nearly complete agreement between predicted and actual phases underscores the model's effectiveness. Furthermore, we integrated the LungSurg model into a custom‐designed software platform (Figure 5B; Demos S1 and S2), enabling automatic extraction of surgical content and generation of predictive images and phase names with segmentation masks, facilitating practical application in clinical settings.

Visualization and expert‐AI comparison of the LungSurg Model. (A) Phase Classification Result Compared to Ground Truth in a Color‐Coded Ribbon Illustration. The Horizontal Axis Represents Time Progression During Surgery. The Upper Ribbon is the Ground Truth, and the Lower is the Predicted Phase. Different Colors Represent Different Phases Corresponding to Figure 2C and 2D. (B) Designed Software Showing Segmentation and Phase Classification Predicted by LungSurg in Real Surgery. The colored outlines represent the contours of the segmentation masks generated by SurgSeg. (C) Expert‐ LungSurg comparison. AI, Artificial Intelligence; PIC, Precision of Instance Classification; PID, Precision of Instance Detection; SID, sensitivity of instance detection.

An ablation study of the classification model's performance after algorithmic improvements was conducted (Table S6). Transitioning from local mode analysis (8‐s clips) to global mode analysis (entire videos) improved the Top‐1 and Top‐3 accuracies by 8.6% and 3.9%, respectively. Incorporating segmentation masks (SegMask) into the action network increased the overall global Top‐1 and Top‐3 accuracies by 6.1% and 11.3%, respectively. Adding segmentation features to the temporal network further enhanced the classification performance by 1.9% and 6.3%. The final SurgClass model, combining the best features of the action network and temporal network, achieved the highest global Top‐1 and Top‐3 accuracies of 0.757 and 0.929, respectively.

Comparison of LungSurg With Human Experts

2.5

To assess LungSurg's performance relative to human expertise, we conducted a comparison experiment using randomly selected short video segments (1–3 s) from the validation set. Three senior surgeons and five junior surgeons participated, annotating vital anatomical structures such as arteries, veins, and bronchi. Performance metrics were calculated by averaging within each group.

In terms of the PID, LungSurg's detection rate was comparable to that of senior surgeons (PID = 0.634 vs. 0.636). For the PIC, senior surgeons achieved a score of 0.933, while LungSurg scored 0.722, similar to junior surgeons (0.700). This suggests that senior surgeons were more confident in their annotations, possibly due to experience. However, LungSurg exhibited a higher SID of 0.777, outperforming senior surgeons by 0.244, indicating greater sensitivity in detecting anatomical structures (Figure 5C). These findings demonstrate that LungSurg performs at a level comparable to experienced surgeons in anatomical identification.

LungSurg‐Assisted Surgical Education

2.6

In our pilot randomized study, the experimental group (n = 10) trained with LungSurg demonstrated markedly superior performance across all evaluated tasks compared to the control group (n = 10) using conventional methods.

For the anatomical structure identification task, the experimental group achieved a mean accuracy of 73.40% (95% CI, 69.5%–77.3%), significantly higher than the control group's 63.0% (95% CI, 58.8%–67.2%; χ ^2^ = 13.58, p < 0.001). Similarly, in surgical instrument identification, the experimental group's accuracy (88.20%; 95% CI, 85.4%–91.0%) was significantly greater than that of the control group (79.60%; 95% CI, 76.1%–83.1%; χ ^2^ = 13.68, p < 0.001). An improvement was also observed in the surgical phase classification task, where the experimental group achieved an accuracy of 79.2% (95% CI, 75.6%–82.8%), and an increase over the control group's 65.4% (95% CI, 61.2%–69.6%; 23.77, p < 0.001).

Discussion

3

In this multicenter study, we developed and validated LungSurg, an advanced AI system designed to assist in VATS lobectomy. Our findings demonstrate that LungSurg can accurately identify critical anatomical structures and surgical instruments, as well as classify procedural phases. This work represents a significant advancement in the integration of AI into complex surgical procedures, with far‐reaching implications for surgical education, intraoperative assistance, patient outcomes, and the broader landscape of surgical technology.

Advancements in Surgical AI for VATS Lobectomy

3.1

While previous AI applications in surgery have predominantly targeted less complex and more standardized procedures such as laparoscopic cholecystectomy and cataract surgery [16, 17], our work extends this frontier to VATS lobectomy, a domain presenting a distinct set of formidable challenges. Unlike cholecystectomy, which has a relatively predictable anatomy and procedural sequence, VATS lobectomy is characterized by significant inter‐patient anatomical variability, a highly dynamic and non‐rigid surgical field due to lung motion, and the critical need to identify and preserve delicate bronchovascular structures that are often obscured by tissue, smoke, or instruments.

Recent advancements in AI‐driven surgical assistance, such as the integration of deep learning models for decision support, have shown promise in enhancing surgical precision and reducing operative times [18]. LungSurg contributes to this growing body of work by not only segmenting anatomical and instrumental elements but also by accurately classifying surgical phases, thereby providing a comprehensive tool for intraoperative guidance.

To our knowledge, LungSurg is the first comprehensive AI system to successfully tackle both anatomical segmentation and phase classification in this high‐stakes environment. Previous systems like Surgesture [13] and SurgSmart [11] laid crucial groundwork in abdominal surgery, but their architectures were not designed to handle the unique visual and contextual complexities of thoracic surgery. By demonstrating expert‐level performance in VATS lobectomy, our study validates the feasibility of deploying advanced AI in one of the most technically demanding areas of minimally invasive surgery, thereby opening a new avenue for AI‐assisted tools in the thoracic domain.

Generative AI in Specialized Surgical Tasks With Limited Sample Sizes

3.2

A cornerstone of LungSurg's methodological innovation is the pioneering use of a generative pre‐training (GPT) strategy to address the “small data” problem inherent in specialized surgical AI. The development of high‐performance models for procedures like VATS lobectomy is often constrained by the limited availability of expert‐annotated data due to the high cost and labor involved [19]. To overcome this bottleneck, we introduce a self‐supervised approach that leverages a vast corpus of unlabeled surgical videos. Specifically, we employ an MAE to pre‐train our model on a generative task–image inpainting–where the model learns to reconstruct randomly occluded regions of a surgical scene [20] (Figure S4).

This pre‐training regimen forces the model to develop a deep, contextual understanding of thoracic anatomy and the spatial relationships between instruments and tissues, effectively building a foundational “visual knowledge base” without any manual labels. It is crucial to distinguish this approach from generative data augmentation, where models are used to create synthetic images. Instead, our method uses a generative task to learn robust representations from real data. As confirmed by our ablation studies (Table S4), this strategy provides a significantly more powerful feature initialization than conventional ImageNet pre‐training or training from scratch. When subsequently fine‐tuned on our limited annotated dataset, the pre‐trained model achieves superior segmentation accuracy. This demonstrates a powerful and scalable paradigm for developing expert‐level AI in data‐scarce medical domains, marking a significant advancement over the reliance on purely supervised learning in prior surgical AI research [21].

Enhancing Surgical Education and Intraoperative Collaboration

3.3

The educational experiment within our study highlights LungSurg's significant potential to transform surgical training. Surgical education traditionally relies on a combination of didactic learning and hands‐on experience, both of which are limited by factors such as access to mentors and variability in surgical exposure [22]. LungSurg addresses these limitations by offering interactive visualizations and feedback, thereby enhancing the learning experience for surgical residents. The observed improvement in anatomical structure identification and surgical phase classification among residents who utilized LungSurg underscores its efficacy as an educational tool.

Moreover, LungSurg's ability to provide automatic identification of surgical phases and anatomical landmarks can foster better intraoperative collaboration. Effective teamwork and communication are paramount for optimal patient outcomes, particularly in complex surgeries where swift and accurate decision‐making is critical [23]. By keeping all team members informed of the current procedural step and anticipated actions, LungSurg facilitates a more coordinated surgical environment, aiming to enhance overall surgical efficiency.

Insights From Comparative Analysis With Human Experts

3.4

Our comparative analysis between LungSurg and human experts revealed that the AI system performs at a level comparable to senior surgeons in detecting vital anatomical structures. While senior surgeons exhibited higher precision (PIC), LungSurg demonstrated greater sensitivity (SID), suggesting its ability to identify a broader range of relevant features. This balance between precision and sensitivity is crucial in surgical settings, where the identification of all relevant anatomical structures is essential to avoid inadvertent injuries.

The high sensitivity of LungSurg may serve as a valuable adjunct to surgical practice, acting as an additional layer of safety by alerting surgeons to anatomical details that might otherwise be overlooked, especially in high‐stress or fast‐paced operative environments.

Innovative Algorithmic Approaches

3.5

The development of LungSurg incorporated several innovative algorithmic strategies that contribute to its high performance. The use of image inpainting techniques allowed the AI to reconstruct occluded or missing parts of the surgical field, thereby enhancing its ability to interpret complex scenes where instruments or tissue may obscure critical anatomy [24, 25]. This approach is particularly valuable in thoracic surgery, where factors such as smoke, blood, and rapid instrument movements frequently challenge visual clarity.

In addition, the integration of a co‐occurrence loss function based on the probability of object pairs improved the model's contextual understanding [26]. By considering the relationships between surgical instruments and anatomical structures, LungSurg achieved more accurate segmentation and classification within the surgical field. The combination of spatial and temporal networks in the SurgClass module enabled effective classification of surgical phases, capturing both the static and dynamic aspects of the procedure.

These algorithmic innovations not only enhance the current performance of LungSurg but also set a foundation for future AI‐driven advancements in surgical technology. The adaptability of these techniques suggests potential applications beyond VATS lobectomy, extending to other complex surgical procedures where anatomical variability and dynamic intraoperative conditions pose significant challenges.

Limitations and Future Directions

3.6

While LungSurg has demonstrated strong performance, its limitations primarily concern model generalizability and robustness, many of which are inherent to our strategic, phased research design. On a macroscopic level, our initial dataset deliberately excluded highly complex clinical cases involving severe adhesions, significant bleeding, or major anatomical anomalies. This was a methodical choice to first establish a robust baseline on standardized procedures. Consequently, the model's generalizability to these high‐difficulty scenarios remains a key focus for future work, where we will expand our dataset and develop specialized modules using techniques like anomaly detection. On a more granular level, even within standard procedures, the model's robustness can be challenged by adverse visual conditions.

As illustrated by typical failure cases (Figure S5), performance can degrade in scenarios with lens blur, heavy smoke, or rapid instrument movements. Future work will therefore also focus on enhancing model resilience through advanced image processing algorithms and the integration of temporal consistency checks. This dual approach–expanding clinical scope while reinforcing technical resilience–is critical for safely translating surgical AI into the high‐stakes environment of thoracic surgery.

Another major avenue for future research is the transition from an offline analysis tool to a real‐time intraoperative guidance system. This presents several interconnected challenges. First, the current system is not optimized for low latency. Achieving real‐time performance will require significant algorithm optimization through techniques like model pruning, quantization, or knowledge distillation to create a “lightweight” model. Second, effective real‐time integration necessitates seamless interoperability with existing surgical platforms and, critically, the design of intuitive user interfaces that enhance, rather than disrupt, the surgeon's workflow. Addressing these technical and human‐factors challenges through user‐centered design will be paramount for translating our system into a valuable tool for live surgery.

Similarly, our educational validation, while promising, was conducted as a pilot study. Although the results were statistically significant, the sample size was modest, and the participants were from a single institution. Furthermore, this study was limited by its focus on short‐term knowledge acquisition. Future larger‐scale, multi‐center trials are needed to confirm these educational benefits and, crucially, to assess long‐term skill retention and the transfer of these cognitive skills to practical performance in simulated or real operating room environments.

Conclusion

4

We have developed and validated LungSurg, an advanced AI system capable of expert‐level recognition of critical anatomical structures and surgical phases in VATS lobectomy. By integrating innovative techniques such as image inpainting and contextual learning, LungSurg addresses the challenges posed by anatomical variability and intraoperative occlusions inherent in thoracic surgery. Our multicenter study demonstrates the system's robustness, accuracy, and potential to transform surgical education and practice.

Methods

5

Study Design and Dataset Collection

5.1

This multicenter study was conducted across eight centers and approved by the Ethics Committee of the National Center for Respiratory Medicine (IRB number: 2022 NO.70). Given the retrospective nature of the study and the use of de‐identified video data, the requirement for individual patient informed consent was waived by the ethics committee. Our objective was to develop a dual‐function AI network capable of identifying (1) surgical structures (instruments and intrathoracic anatomy) and (2) surgical phases in VATS lobectomy procedures (Figure 1).

Between January 1, 2018, and January 1, 2024, we collected 222 integrated VATS lobectomy videos from eight centers encompassing complete surgical procedures–from endoscope insertion to wound closure. All surgeries were performed by senior thoracic surgeons, and the videos were reviewed by two experts (J.X.H. and W.W.) for quality assurance.

To build robust training and validation datasets, we allocated 165 videos from four centers for AI model development and reserved 57 videos from other centers for independent external validation (Figure 1A). These videos included procedures involving all five lung lobes, ensuring comprehensive coverage of surgical scenarios. To focus on standard surgical practices, we excluded videos featuring unconventional cases such as severe thoracic adhesions, significant intraoperative bleeding, complex lymph node dissections, or advanced procedures like sleeve lobectomy. The video data were collected from various endoscopic systems, including 2D and 3D laparoscopes from multiple manufacturers (e.g., Karl Storz, Stryker, Olympus), ensuring hardware heterogeneity in the dataset.

Video Annotation and Dataset Preparation

5.2

Patient identifiers were removed from all videos to maintain confidentiality. We extracted frames at a rate of one fps, sequentially numbering each frame. To manage the extensive annotation workload and enhance scene diversity, we adjusted the annotation frequency based on scene complexity.

Six certified thoracic surgeons, divided into two groups, performed the annotations. Videos were randomly allocated to each group in a 1:1 ratio. Within each group, annotations were cross‐validated to achieve consensus and minimize subjectivity. Using Labelme software (Massachusetts Institute of Technology, USA) [27], surgeons outlined and labeled surgical instruments and anatomical structures, creating a comprehensive segmentation dataset. Concurrently, we established a phase classification database by recording specific surgical phases along with corresponding start and end frame numbers.

We calculated Cohen's kappa scores between the two groups to assess intergroup annotation consistency and evaluate the impact of subjectivity on data quality. Detailed annotation protocols are provided in the Supporting Information‐Annotation framework for surgical video.

Identification of Surgical Structures and Phases

5.3

We focused on commonly used surgical instruments and relevant anatomical structures in VATS lobectomy. The instruments annotated included staplers, suction devices, clamps, ultrasonic knives, and electrocautery tools (Figure 6A). The anatomical structures included the PA, PV, bronchus, LNs, and azygos vein on the right side, and the aorta on the left side (Figure 6B).

Demonstration of the annotated surgical structures and surgical phase involved in the Video‐assisted Thoracoscopic Lobectomy. (A) Presentation of the instruments. a Electrocoagulation; b Ultrasonic knife; c Stapler; d Suction; e Clamp. (B) Presentation of the thoracic anatomy and surgical areas relative to lobectomy with different phases. LLL, left lower lobe; LN, lymph node; LUL, right upper lobe; PA, pulmonary artery; PV, pulmonary vein; RLL, right lower lobe; RML, right middle lobe; RUL, right upper lobe.

Given the absence of a standardized consensus on VATS lobectomy steps–which can exceed 20 distinct actions–we referred to the essential components outlined by the American Society of Thoracic Surgeons for training simulation and adapted them to our study into 14 actions (Table S7) [3, 28, 29, 30]. The surgical phases primarily included management of pulmonary bronchovascular structures, LNs, pulmonary fissures, and surgical field examination etc. The characteristics of the annotated data for the training set and the validation set are summarizes in Table S1.

AI Model Architecture and Training

5.4

We developed an AI system named LungSurg, featuring two main networks: SurgSeg for surgical structure identification and SurgClass for surgical phase classification. The annotated datasets for surgical structures and phases were used to train SurgSeg and SurgClass, respectively. SurgSeg first analyzes input video clips to generate segmentation masks, which are then utilized by SurgClass for phase classification–integrating both spatial and temporal information (Figure 1B).

SurgSeg Architecture and Training

5.5

To enhance scene diversity despite a finite number of annotations, we employed the image‐based instance segmentation model Mask R‐CNN [31] with a five‐stage ResNet‐50 backbone for SurgSeg. Recognizing that many target objects are slender and irregularly shaped, we incorporated a NL module [32] into the last three stages of the backbone to capture long‐range dependencies and enlarge the receptive field. In addition, inspired by the Bottleneck Transformer architecture, we modified the final stage using the MHSA module [33], further improving segmentation performance.

Innovative Pre‐Training With Image Inpainting

5.6

We introduced an innovative self‐supervised pre‐training strategy inspired by MAE [20] to learn robust feature representations from unlabeled data. In this paradigm, the model is trained on a generative task: reconstructing randomly occluded portions (up to 75%) of real, unlabeled surgical images (Figure S4). We term this “GPT” because the task's objective is inherently generative, forcing the model to develop a deep contextual understanding of surgical scenes–akin to how large language models are generatively pre‐trained–before being fine‐tuned on a specific downstream task. Crucially, this approach is distinct from using generative models (e.g., GANs) to augment datasets with entirely synthetic images. The powerful contextual representations learned during this phase were then fine‐tuned for the primary segmentation task.

To further leverage contextual dependencies between surgical instruments and anatomical structures, we calculated co‐occurrence probabilities between object pairs and introduced a novel co‐occurrence loss function during the fine‐tuning stage (Figure S6). This encouraged the model to consider relational information, enhancing segmentation accuracy.

The final SurgSeg model was trained by combining five loss functions: Region Proposal Network (RPN) loss (smooth L1 loss and binary cross‐entropy loss), Fast R‐CNN loss [34] (co‐occurrence loss and smooth L1 loss), and Mask R‐CNN loss [31] (binary cross‐entropy loss). We used the Adam optimizer with an initial learning rate of 0.0001. During the first 20 epochs, the weights of the pre‐trained encoder were frozen to stabilize training, after which all layers were unfrozen for end‐to‐end optimization. Images were resized to 512 × 512 pixels, and training was conducted with a batch size of eight on Tesla V100 GPUs.

SurgClass Architecture and Training

5.7

To enable SurgClass to effectively utilize knowledge from the segmentation task, we integrated four modalities of information from SurgSeg (Figure 1B):

Per‐pixel confidence scores from SurgSeg.
Dot product results between the original image and SurgSeg's segmentation mask.
Segmentation features from SurgSeg's encoder.
Pre‐trained image features from the encoder are used in pre‐training.

Features 1 and 2 were processed using a 3D action recognition network based on MobileNet_V3 [35], capturing spatial‐temporal patterns. Features 3 and 4 were input into a bidirectional long short‐term memory (bi‐LSTM) network [36] to model temporal dependencies. An MLP classifier then integrated outputs from both networks to produce the final phase classification for each video segment.

SurgClass was trained in two modes:

Local mode: Classifying 4–8‐s video segments to determine the corresponding surgical phase.
Global mode: Performing sliding window analysis across the entire surgery to output temporal segmentation of phases and transition probabilities.

Training utilized the Adam optimizer with a learning rate of 0.0001. If performance plateaued within three epochs, we switched to a multi‐class focal loss function to address class imbalance. Images were resized to 256 × 256 pixels, and training was conducted with a batch size of 16 on four Tesla V100 GPUs.

LungSurg‐Assisted Surgical Education

5.8

To assess LungSurg's educational potential, we conducted a prospective, single‐blind, pilot randomized controlled study. The written informed consent was obtained from all participants.

A total of 20 first‐year surgical residents with no prior specialized training in thoracic surgery were recruited to ensure a homogenous baseline. Participants were randomly assigned (1:1 ratio) via a computer‐generated sequence to either a control group or an experimental group. The assessors who graded the post‐training tests were blinded to the group allocations. Both groups followed a 1‐week structured curriculum (2 h/day). The control group (n = 10) used conventional learning resources (textbooks and standard surgical videos). The experimental group (n = 10) had access to the same resources, supplemented with the interactive LungSurg platform, which provides AI‐powered segmentation overlays and phase classification.

Post‐training performance was evaluated using a standardized, unseen test set. The assessment consisted of two tasks:

Anatomical and Instrument Identification: Participants were shown 50 static surgical screenshots and asked to identify pre‐specified anatomical structures and surgical instruments. An identification was scored as correct only if the object was both correctly labeled and accurately localized. The primary metric was identification accuracy, calculated separately for anatomical structures and instruments. Surgical Phase Classification: Participants viewed fifty 8‐s video clips and identified the corresponding surgical phase from a 14‐option multiple‐choice list. The primary metric was classification accuracy.

The accuracy rates between the two groups were compared using the chi‐squared (χ ^2^) test for independence to analyze the association between group allocation and the correctness of responses. For any analyses where the expected frequency in any cell of the contingency table was less than 5, Fisher's exact test was used instead. Given the pilot nature of this study, results were primarily interpreted based on effect sizes and 95% confidence intervals, with p‐values provided for reference.

Statistical Analysis

5.9

For evaluating SurgSeg's performance, we employed the IoU metric: (Label_Mask ∩ Prediction_Mask) / (Label_Mask U Prediction_Mask). A threshold of IoU > 0.5 was considered correct segmentation. The mAP at IoU > 0.5 was calculated as the average area under the Precision‐Recall curve for each category. SurgClass's performance was assessed using Top‐1 and Top‐3 accuracy metrics for phase classification on each video frame.

All statistical analyses were performed using Python 3.8. Visualizations were created using Matplotlib. Labelme software (Massachusetts Institute of Technology, USA) was used to finish the labeling of the surgical elements.

Author Contributions

H.L. gave conception and design of this study. L.Z., Y.G., and J.H. offered the administrative support. H.L., Z.Y., and Yudong Z. developed the LungSurg system. K.D., H.L., J.S., P.L., J.J., G.Z., X.Z., and H.C. assisted to improve the system. H.L., and Z.Y. conducted experimental analysis. H.Z., Yuzhuo Z., S.L., M.C., X.W., A.R., and W.W. leaded the data collection. H.L., Z.Y., L.Z., Y.G., and J.H. wrote the manuscript. H.L. has accessed and verified the data. H.L., L.Z., Y.G., and J.H. were responsible for the decision to submit the manuscript. All authors read and approved the manuscript.

Funding

This work was supported by the Grant R&D Program of Guangzhou National Laboratory (No. SRPG22‐017), 2025 Engineering Science and Technology Academic Symposium Project of the Chinese Academy of Engineering (2025‐XS‐22), and the National Natural Science Foundation of China (No. 82503973). The study sponsors had no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the paper for publication.

Ethics Statement

This multicenter study was conducted across eight centers and approved by the Ethics Committee of the National Center for Respiratory Medicine (IRB number: 2022 NO.70). Given the retrospective nature of the study and the use of de‐identified video data, the requirement for individual patient informed consent was waived by the ethics committee.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Table S1a: Annotation data characteristics for training set Table S1b: Annotation data characteristics for validation set Table S2 The overall segmentation performance of the LungSurg trained in a single center and validated on the external validation set Table S3: Performance of LungSurg on the External Validation Set, Stratified by Center Table S4: Ablation experiments with the ** SurgSeg ** in internal validation Table S5: The overall classification performance of the ** LungSurg ** trained in a single center on the external validation set. Table S6: Ablation experiment results of ** SurgClass ** (TOP1 and TOP3 ACC for each frame in the internal validation set) Table S7: Definition of each phase of Video‐assisted thoracoscopic (VATS) lobectomy Figure S1: Duration of each surgical phase in the left (A) and right (B) lung Figure S2: Visual assessment of the model's segmentation performance on surgical scenes from external validation centers Figure S3: Ablation study on segmentation task Figure S4: Image filling task in pre‐training Figure S5: Examples of incorrect segmentation and classification Figure S6: Co‐occurrence loss of each annotation in the left and the right VATS lobectomy Supplementary ‐Annotation framework for surgical video

Supporting File 1: mco270613‐sup‐0002‐DemoS1.mp4.

Supporting File 2: mco270613‐sup‐0002‐DemoS2.mp4.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1H. Sung , J. Ferlay , R. L. Siegel , et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA: A Cancer Journal for Clinicians 71, no. 3 (2021): 209–249.33538338 10.3322/caac.21660 · doi ↗ · pubmed ↗
2T. A. D'Amico , “The Video‐Assisted Thoracoscopic or Open Lobectomy (VIOLET) Trial: The Final Chapter to This Epic,” Journal of Thoracic and Cardiovascular Surgery 166, no. 1 (2023): 265–267.36774208 10.1016/j.jtcvs.2022.12.022 · doi ↗ · pubmed ↗
3T. A. Haidari , L. J. Nayahangan , F. Bjerrum , et al., “Consensus on Technical Procedures for Simulation‐Based Training in Thoracic Surgery: An International Needs Assessment,” European Journal of Cardio‐Thoracic Surgery 63, no. 4 (2023): ezad 058.36808223 10.1093/ejcts/ezad 058 · doi ↗ · pubmed ↗
4A. Vieira , E. Bourdages‐Pageau , K. Kennedy , and P. A. Ugalde , “The Learning Curve on Uniportal Video‐Assisted Thoracic Surgery: An Analysis of Proficiency,” Journal of Thoracic and Cardiovascular Surgery 159, no. 6 (2020): 2487–2495.e 2.31926696 10.1016/j.jtcvs.2019.11.006 · doi ↗ · pubmed ↗
5S. Grossi , M. Cattoni , N. Rotolo , and A. Imperatori , “Video‐Assisted Thoracoscopic Surgery Simulation and Training: A Comprehensive Literature Review,” BMC Medical Education 23, no. 1 (2023): 535.37501111 10.1186/s 12909-023-04482-z PMC 10375656 · doi ↗ · pubmed ↗
6Y. Li , N. Raison , S. Ourselin , T. Mahmoodi , P. Dasgupta , and A. Granados , “AI Solutions for Overcoming Delays in Telesurgery and Telementoring to Enhance Surgical Practice and Education,” Journal of Robotic Surgery 18, no. 1 (2024): 403.39527379 10.1007/s 11701-024-02153-9PMC 11554828 · doi ↗ · pubmed ↗
7D. A. Hashimoto , G. Rosman , D. Rus , and O. R. Meireles , “Artificial Intelligence in Surgery: Promises and Perils,” Annals of Surgery 268, no. 1 (2018): 70–76.29389679 10.1097/SLA.0000000000002693 PMC 5995666 · doi ↗ · pubmed ↗
8T. M. Ward , D. A. Hashimoto , Y. Ban , et al., “Automated Operative Phase Identification in Peroral Endoscopic Myotomy,” Surgical Endoscopy 35, no. 7 (2021): 4008–4015.32720177 10.1007/s 00464-020-07833-9PMC 7854950 · doi ↗ · pubmed ↗