# CataractSurg-80K: Knowledge-Driven Benchmarking for Structured Reasoning in Ophthalmic Surgery Planning

**Authors:** Yang Meng, Zewen Pan, Yandi Lu, Ruobing Huang, Yanfeng Liao, Jiarui Yang

arXiv: 2508.20014 · 2025-08-28

## TL;DR

This paper introduces CataractSurg-80K, a large-scale ophthalmic surgery planning benchmark and a domain-specific language model, Qwen-CSP, designed to improve structured reasoning and decision support in cataract surgery.

## Contribution

It presents the first comprehensive cataract surgery planning benchmark with structured reasoning annotations and a specialized LLM, Qwen-CSP, tailored for ophthalmic clinical decision-making.

## Key findings

- Qwen-CSP outperforms general LLMs on multiple metrics.
- The benchmark enables rigorous evaluation of surgical reasoning.
- The dataset facilitates future research in medical AI reasoning.

## Abstract

Cataract surgery remains one of the most widely performed and effective procedures for vision restoration. Effective surgical planning requires integrating diverse clinical examinations for patient assessment, intraocular lens (IOL) selection, and risk evaluation. Large language models (LLMs) have shown promise in supporting clinical decision-making. However, existing LLMs often lack the domain-specific expertise to interpret heterogeneous ophthalmic data and provide actionable surgical plans. To enhance the model's ability to interpret heterogeneous ophthalmic reports, we propose a knowledge-driven Multi-Agent System (MAS), where each agent simulates the reasoning process of specialist ophthalmologists, converting raw clinical inputs into structured, actionable summaries in both training and deployment stages. Building on MAS, we introduce CataractSurg-80K, the first large-scale benchmark for cataract surgery planning that incorporates structured clinical reasoning. Each case is annotated with diagnostic questions, expert reasoning chains, and structured surgical recommendations. We further introduce Qwen-CSP, a domain-specialized model built on Qwen-4B, fine-tuned through a multi-stage process tailored for surgical planning. Comprehensive experiments show that Qwen-CSP outperforms strong general-purpose LLMs across multiple metrics. Our work delivers a high-quality dataset, a rigorous benchmark, and a domain-adapted LLM to facilitate future research in medical AI reasoning and decision support.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20014/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20014/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/2508.20014/full.md

---
Source: https://tomesphere.com/paper/2508.20014