Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

TL;DR
This paper introduces Hi-GaTA, a hierarchical temporal adapter for surgical video report generation, leveraging a new benchmark, surgical-specific pretraining, and a novel architecture to improve report quality.
Contribution
The paper proposes Hi-GaTA, a lightweight temporal adapter with a new perception-alignment-reasoning framework, and establishes a surgical video report benchmark with a large pretraining dataset.
Findings
Achieves state-of-the-art performance on surgical report generation.
Demonstrates the effectiveness of the Hi-GaTA architecture and pretraining strategy.
Validates each component through ablation studies.
Abstract
Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
