Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Kedi Sun; Chaohui Dang; Yue Feng; James Glasbey; Theodoros N. Arvanitis; Le Zhang

arXiv:2605.11208·cs.CV·May 19, 2026

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

PDF

TL;DR

This paper introduces Hi-GaTA, a hierarchical temporal adapter for surgical video report generation, leveraging a new benchmark, surgical-specific pretraining, and a novel architecture to improve report quality.

Contribution

The paper proposes Hi-GaTA, a lightweight temporal adapter with a new perception-alignment-reasoning framework, and establishes a surgical video report benchmark with a large pretraining dataset.

Findings

01

Achieves state-of-the-art performance on surgical report generation.

02

Demonstrates the effectiveness of the Hi-GaTA architecture and pretraining strategy.

03

Validates each component through ablation studies.

Abstract

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.