Surgical Instrument-Tissue Interaction Recognition with Multi-Task-Attention Video Transformer

1 Institute of Medical Technology and Intelligent Systems, Hamburg University of Technology, Germany
2 Martini-Klinik Prostate Cancer Center, University Medical Center Hamburg-Eppendorf, Germany
3 Department of Urology, University Medical Center Hamburg-Eppendorf,Germany

*Corresponding author (lennart.maack@tuhh.de)

Abstract

Purpose The recognition of surgical instrument-tissue interactions can enhance the surgical workflow analysis, improve automated safety systems and enable skill assessment in minimally invasive surgery. However, current deep learning methods for surgical instrument-tissue interaction recognition often rely on static images or coarse temporal sampling, limiting their ability to capture rapid surgical dynamics. Therefore, this study systematically investigates the impact of incorporating fine-grained temporal context into deep learning models for interaction recognition.
Methods We conduct extensive experiments with multiple curated video-based datasets to investigate the influence of fine-grained temporal context for the task of instrument-tissue interaction recognition using video transformer with spatio-temporal feature extraction capabilities. Additionally, we propose a multi-task-attention module that utilizes cross-attention and a gating mechanism to improve communication between the subtasks of identifying the surgical instrument, atomic action, and anatomical target.
Results Our study demonstrates the benefit of utilizing the fine-grained temporal context for recognition of instrument-tissue interactions, with an optimal sampling rate of 6-8 Hz identified for the examined datasets. Furthermore, our proposed MTAM significantly outperforms state-of-the-art multi-task video transformer on the CholecT45-Vid and GraSP-Vid datasets, achieving relative increases of 4.8% and 5.9% in surgical instrument-tissue interaction recognition, respectively.
Conclusion In this work, we demonstrate the benefits of using a fine-grained temporal context rather than static images or coarse temporal context for the task of surgical instrument-tissue interaction recognition. We also show that leveraging cross-attention with spatio-temporal features from various subtasks leads to improved surgical instrument-tissue interaction recognition performance.

BibTeX

BibTex will be made available soon