SurgSTU: A Deterministic Pipeline for Spatial-Temporal Surgical QA

Dataset Examples

Spatial-Temporal Grounding Identify by BBox prosta

Q: At 21.80 s, which surgical instrument is positioned within the bounding box {"bbox_2d": [675.8, 382.0, 980.5, 1000.0]} in normalized [0, 1000] format?

A: The instrument located at these coordinates is the aspirator.

Metric: Exact-match (instrument name)

Spatial-Temporal Interaction Captioning Reverse Target prosta

Q: Are there any instruments actively interacting with the target bladder between 15.03 s and 17.03 s? If so, which ones?

A: Yes, the following instruments are interacting with bladder: forceps, scissors.

Metric: Set F1 (instrument set)

Spatial-Temporal Grounding Trajectory prosta

Q: Trace the trajectory of the surgical instrument scissors between 7.73 s and 15.83 s and provide the spatiotemporal coordinates for its most extreme top and bottom positions in normalized [0, 1000] format.

A: Extreme Top: {"bbox_2d": [581.0, 251.2, 909.2, 515.6]} at t=7.73 s; Extreme Bottom: {"bbox_2d": [518.5, 460.2, 735.3, 720.3]} at t=15.83 s.

Metric: Spatiotemporal error + center distance

Spatial-Temporal Interaction Captioning Next Action + Target prosta

Q: Following the completion of idling by the forceps, identify both the subsequent verb AND the subsequent target.

A: After idling, the forceps transitions to retract on the bladder.

Metric: Exact-match (verb + target)

Spatial-Temporal Grounding Relative Change prosta

Q: Between 21.20 s and 25.73 s, what is the observed change in the spatial relationship between the grasper and the aspirator?

A: The grasper approximately maintained its distance from the aspirator (distance change 0.02; below the 0.30 significance threshold).

Metric: Relation accuracy + numeric relative error

Multi Choice Counting (concurrent) prosta

Q: What is the highest concurrent tool count observed throughout the segment spanning from 20.63 s to 27.17 s? Options: A) 4 B) 3 C) 5 D) 0

A: Answer: B) 3

Metric: Exact-match accuracy

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models.

To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 6711 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples.

Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos.

The SurgSTU-Pipeline: surgical videos plus metadata are converted to structured Event Tuples, continuity-filtered, and completed into spatial-temporal QA templates.

The SurgSTU-Pipeline turns surgical videos and their instrument-localization / action-triplet metadata into structured Event Tuples, applies spatial / temporal / semantic continuity filtering, and completes predefined QA templates. Deterministically, with no LLM in the loop.

Question types

Spatial-Temporal Grounding 11 subcategories

Bbox grounding, trajectories, frame-region classification, reverse grounding, and explicit refusals.

Show all 11 subcategories

locate
"Locate the grasper at 4.20 s and provide its bounding box."
window
"For every instrument visible at 4.20 s, provide a temporal window during which it remains visible."
identify_bbox
"Which instrument is located at the bounding box [412, 188, 591, 374] at 4.20 s?"
segment
"In which of the 3×3 frame regions does the grasper reside at 4.20 s?"
closest
"Which instrument is closest to the coordinates [0.43, 0.58] at 4.20 s?"
trajectory
"Trace the trajectory of the grasper between 4.20 s and 7.80 s as bounding boxes at the leftmost and rightmost extrema."
rel_pos
"Describe the relative position of the grasper with respect to the hook at 4.20 s."
rel_change
"Between 4.20 s and 7.80 s, did the grasper move closer to, further from, or approximately maintain its distance from the hook?"
locate_by_action
"At 4.20 s, provide the bounding box of the instrument performing ‘grasp’."
locate_by_target
"At 4.20 s, locate the instrument interacting with the gallbladder."
refusal_absent_instrument
"Where is the bipolar at 4.20 s?" — answer: not present in this clip.

Multi Choice 7 subcategories

Four-way and binary MCQs over instrument classes, presence, and counting — with distractors drawn from the instrument vocabulary.

Show all 7 subcategories

Classes
"Which set of instruments is visible at 4.20 s?"
Existence:global
"Is a bipolar present anywhere in this clip? (Yes / No)"
Existence:local_inst
"Is the grasper present at 4.20 s? (Yes / No)"
Existence:local_target
"Is any instrument interacting with the gallbladder at 4.20 s? (Yes / No)"
Counting:distinct
"How many distinct instrument classes appear between 4.20 s and 9.30 s?"
Counting:specific
"How many graspers are visible at 4.20 s?"
Counting:concurrent
"What is the maximum number of instruments visible simultaneously between 4.20 s and 9.30 s?"

Spatial-Temporal Interaction Captioning 18 subcategories

What-is-doing-what at a point, plus temporal aggregation, ordering, and compositional whole-clip queries.

Show all 18 subcategories

Bucket A — temporal aggregation

interaction_duration
"For how long does the grasper interact with the gallbladder between 0 s and 14 s?"
interaction_count
"How many times does the grasper perform ‘grasp’ over the clip?"
longest_continuous_action
"What is the longest continuous (grasp, gallbladder) interaction in the clip?"
idle_duration
"What fraction of the clip is the grasper idling?"

Bucket B — ordering

first_appearance_time
"At what time does the grasper first appear?"
last_appearance_time
"At what time is the grasper last visible?"
first_action
"What is the first action the grasper performs?"
action_sequence
"List the chronological sequence of (verb, target) pairs the grasper performs."

Bucket C — whole-clip MCQ

most_active_instrument
"Which instrument is most active over the whole clip?"
most_target_diversity_instrument
"Which instrument interacts with the most distinct anatomical targets?"
dominant_verb
"What is the dominant verb performed across the clip?"
distinct_targets_touched
"How many distinct anatomical targets does any instrument touch in the clip?"

Legacy + refusal

target_interaction
"At 4.20 s, what target is the grasper interacting with?"
action_status
"At 4.20 s, is the grasper actively performing an action or idling?"
next_action
"What action does the grasper perform next after 4.20 s?"
comparison
"Between 4.20 s and 7.80 s, do the grasper and the hook interact with the same target?"
reverse_target
"At 4.20 s, which instrument is interacting with the cystic duct?"
refusal_no_specific_action
"What action is the grasper performing at 4.20 s?" — answer: idling, no specific action.

Acknowledgements

SurgSTU currently stands on publicly available surgical datasets. The cholecystectomy half builds on CholecT50 (Nwoye et al.) for instrument-verb-target triplets and on CholecTrack20 (Nwoye et al.) for instrument bounding boxes. The prostatectomy half builds on ProstaTD for instrument-verb-target annotations; ProstaTD bounding boxes are internally annotated by the authors.

BibTeX

@inproceedings{maack2026approach,
  title={An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models},
  author={Maack, Lennart and Schlaefer, Alexander},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2945--2954},
  year={2026}
}