SurgSTU

@

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

1 Institute of Medical Technology and Intelligent Systems, Hamburg University of Technology, Germany
*Corresponding author (lennart.maack@tuhh.de)
DataCV @ CVPR 2026 Workshop

The dataset and benchmark will be made available with the upcoming Journal publication.

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models.

To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 6711 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples.

Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos.

The SurgSTU-Pipeline: surgical videos plus metadata are converted to structured Event Tuples, continuity-filtered, and completed into spatial-temporal QA templates.

The SurgSTU-Pipeline turns surgical videos and their instrument-localization / action-triplet metadata into structured Event Tuples, applies spatial / temporal / semantic continuity filtering, and completes predefined QA templates. Deterministically, with no LLM in the loop.

Question types

Spatial-Temporal Grounding 11 subcategories

Bbox grounding, trajectories, frame-region classification, reverse grounding, and explicit refusals.

Show all 11 subcategories
  • locate
    "Locate the grasper at 4.20 s and provide its bounding box."
  • window
    "For every instrument visible at 4.20 s, provide a temporal window during which it remains visible."
  • identify_bbox
    "Which instrument is located at the bounding box [412, 188, 591, 374] at 4.20 s?"
  • segment
    "In which of the 3×3 frame regions does the grasper reside at 4.20 s?"
  • closest
    "Which instrument is closest to the coordinates [0.43, 0.58] at 4.20 s?"
  • trajectory
    "Trace the trajectory of the grasper between 4.20 s and 7.80 s as bounding boxes at the leftmost and rightmost extrema."
  • rel_pos
    "Describe the relative position of the grasper with respect to the hook at 4.20 s."
  • rel_change
    "Between 4.20 s and 7.80 s, did the grasper move closer to, further from, or approximately maintain its distance from the hook?"
  • locate_by_action
    "At 4.20 s, provide the bounding box of the instrument performing ‘grasp’."
  • locate_by_target
    "At 4.20 s, locate the instrument interacting with the gallbladder."
  • refusal_absent_instrument
    "Where is the bipolar at 4.20 s?" — answer: not present in this clip.
Multi Choice 7 subcategories

Four-way and binary MCQs over instrument classes, presence, and counting — with distractors drawn from the instrument vocabulary.

Show all 7 subcategories
  • Classes
    "Which set of instruments is visible at 4.20 s?"
  • Existence:global
    "Is a bipolar present anywhere in this clip? (Yes / No)"
  • Existence:local_inst
    "Is the grasper present at 4.20 s? (Yes / No)"
  • Existence:local_target
    "Is any instrument interacting with the gallbladder at 4.20 s? (Yes / No)"
  • Counting:distinct
    "How many distinct instrument classes appear between 4.20 s and 9.30 s?"
  • Counting:specific
    "How many graspers are visible at 4.20 s?"
  • Counting:concurrent
    "What is the maximum number of instruments visible simultaneously between 4.20 s and 9.30 s?"
Spatial-Temporal Interaction Captioning 18 subcategories

What-is-doing-what at a point, plus temporal aggregation, ordering, and compositional whole-clip queries.

Show all 18 subcategories

Bucket A — temporal aggregation

  • interaction_duration
    "For how long does the grasper interact with the gallbladder between 0 s and 14 s?"
  • interaction_count
    "How many times does the grasper perform ‘grasp’ over the clip?"
  • longest_continuous_action
    "What is the longest continuous (grasp, gallbladder) interaction in the clip?"
  • idle_duration
    "What fraction of the clip is the grasper idling?"

Bucket B — ordering

  • first_appearance_time
    "At what time does the grasper first appear?"
  • last_appearance_time
    "At what time is the grasper last visible?"
  • first_action
    "What is the first action the grasper performs?"
  • action_sequence
    "List the chronological sequence of (verb, target) pairs the grasper performs."

Bucket C — whole-clip MCQ

  • most_active_instrument
    "Which instrument is most active over the whole clip?"
  • most_target_diversity_instrument
    "Which instrument interacts with the most distinct anatomical targets?"
  • dominant_verb
    "What is the dominant verb performed across the clip?"
  • distinct_targets_touched
    "How many distinct anatomical targets does any instrument touch in the clip?"

Legacy + refusal

  • target_interaction
    "At 4.20 s, what target is the grasper interacting with?"
  • action_status
    "At 4.20 s, is the grasper actively performing an action or idling?"
  • next_action
    "What action does the grasper perform next after 4.20 s?"
  • comparison
    "Between 4.20 s and 7.80 s, do the grasper and the hook interact with the same target?"
  • reverse_target
    "At 4.20 s, which instrument is interacting with the cystic duct?"
  • refusal_no_specific_action
    "What action is the grasper performing at 4.20 s?" — answer: idling, no specific action.

Acknowledgements

SurgSTU currently stands on publicly available surgical datasets. The cholecystectomy half builds on CholecT50 (Nwoye et al.) for instrument-verb-target triplets and on CholecTrack20 (Nwoye et al.) for instrument bounding boxes. The prostatectomy half builds on ProstaTD for instrument-verb-target annotations; ProstaTD bounding boxes are internally annotated by the authors.

BibTeX

@inproceedings{maack2026approach,
  title={An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models},
  author={Maack, Lennart and Schlaefer, Alexander},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2945--2954},
  year={2026}
}